Stable Reinforcement Learning for Efficient Reasoning
By: Muzhi Dai, Shixuan Liu, Qingyi Si
Potential Business Impact:
Makes AI think smarter, faster, and more accurately.
The success of Deepseek-R1 has drawn the LLM community's attention to reinforcement learning (RL) methods like GRPO. However, such rule-based 0/1 outcome reward methods lack the capability to regulate the intermediate reasoning processes during chain-of-thought (CoT) generation, leading to severe overthinking phenomena. In response, recent studies have designed reward functions to reinforce models' behaviors in producing shorter yet correct completions. Nevertheless, we observe that these length-penalty reward functions exacerbate RL training instability: as the completion length decreases, model accuracy abruptly collapses, often occurring early in training. To address this issue, we propose a simple yet effective solution GRPO-$\lambda$, an efficient and stabilized variant of GRPO, which dynamically adjusts the reward strategy by monitoring the correctness ratio among completions within each query-sampled group. A low correctness ratio indicates the need to avoid length penalty that compromises CoT quality, triggering a switch to length-agnostic 0/1 rewards that prioritize reasoning capability. A high ratio maintains length penalties to boost efficiency. Experimental results show that our approach avoids training instability caused by length penalty while maintaining the optimal accuracy-efficiency trade-off. On the GSM8K, GPQA, MATH-500, AMC 2023, and AIME 2024 benchmarks, it improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
Similar Papers
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
Artificial Intelligence
Makes AI think less, answer smarter.
Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning
Computation and Language
Makes AI think smarter, not longer.
Reinforcing Video Reasoning with Focused Thinking
CV and Pattern Recognition
Helps computers understand videos by focusing on important parts.