Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning
By: Yiliu Sun , Zicheng Zhao , Yang Wei and more
Potential Business Impact:
Teaches computers to think better from the start.
Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.
Similar Papers
Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation
Computation and Language
Makes AI smarter with less computer power.
CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment
Machine Learning (CS)
Boosts AI thinking with step-by-step feedback
Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models
Computation and Language
Teaches computers to solve math problems better.