RPO:Reinforcement Fine-Tuning with Partial Reasoning Optimization
By: Hongzhu Yi , Xinming Wang , Zhenghao zhang and more
Potential Business Impact:
Makes AI learn much faster and cheaper.
Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.
Similar Papers
R$^2$PO: Decoupling Training Trajectories from Inference Responses for LLM Reasoning
Machine Learning (CS)
Makes AI smarter by training it better.
G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Artificial Intelligence
Helps small AI learn to think better.
Effective Reinforcement Learning for Reasoning in Language Models
Artificial Intelligence
Teaches computers to think better and faster.