Score: 0

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Published: December 17, 2025 | arXiv ID: 2512.15274v1

By: Yiliu Sun , Zicheng Zhao , Yang Wei and more

Potential Business Impact:

Teaches computers to think better from the start.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reinforcement Learning with Verifiable Rewards (RLVR) significantly enhances the reasoning capability of Large Language Models (LLMs). Current RLVR approaches typically conduct training across all generated tokens, but neglect to explore which tokens (e.g., prefix tokens) actually contribute to reasoning. This uniform training strategy spends substantial effort on optimizing low-return tokens, which in turn impedes the potential improvement from high-return tokens and reduces overall training effectiveness. To address this issue, we propose a novel RLVR approach called Progressive Prefix-token Policy Optimization (PPPO), which highlights the significance of the prefix segment of generated outputs. Specifically, inspired by the well-established human thinking theory of Path Dependence, where early-stage thoughts substantially constrain subsequent thinking trajectory, we identify an analogous phenomenon in LLM reasoning termed Beginning Lock-in Effect (BLE). PPPO leverages this finding by focusing its optimization objective on the prefix reasoning process of LLMs. This targeted optimization strategy can positively influence subsequent reasoning processes, and ultimately improve final results. To improve the learning effectiveness of LLMs on how to start reasoning with high quality, PPPO introduces two training strategies: (a) Progressive Prefix Retention, which shapes a progressive learning process by increasing the proportion of retained prefix tokens during training; (b) Continuation Accumulated Reward, which mitigates reward bias by sampling multiple continuations for one prefix token sequence, and accumulating their scores as the reward signal. Extensive experimental results on various reasoning tasks demonstrate that our proposed PPPO outperforms representative RLVR methods, with the accuracy improvements of 18.02% on only 26.17% training tokens.

From Data-Centric to Sample-Centric: Enhancing LLM Reasoning via Progressive Optimization

Machine Learning (CS)

Teaches computers to solve hard problems faster.

9 Jul 2025 1

92%

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning

Machine Learning (CS)

Teaches computers to find mistakes in thinking.

26 Jan 2026 0

90%

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation

Computation and Language

Makes AI smarter with less computer power.

17 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

9 pages

Well Begun, Half Done: Reinforcement Learning with Prefix Optimization for LLM Reasoning

Teaches computers to think better from the start.

Technical Abstract

From Data-Centric to Sample-Centric: Enhancing LLM Reasoning via Progressive Optimization

Save the Good Prefix: Precise Error Penalization via Process-Supervised RL to Enhance LLM Reasoning

Enhancing LLM Reasoning with Iterative DPO: A Comprehensive Empirical Investigation