Boosting Accuracy and Efficiency of Budget Forcing in LLMs via Reinforcement Learning for Mathematical Reasoning
By: Ravindra Aribowo Tarunokusumo, Rafael Fernandes Cunha
Potential Business Impact:
Makes AI better at math, uses fewer words.
Test-time scaling methods have seen a rapid increase in popularity for its computational efficiency and parameter-independent training to improve reasoning performance on Large Language Models. One such method is called budget forcing, a decoding intervention strategy which allocates extra compute budget for thinking and elicits the inherent self-correcting behavior of the model. However, this relies on supervised fine-tuning (SFT) on long-context reasoning traces which causes performance degradation on smaller models due to verbose responses. For this reason, we offer a framework integrating reinforcement learning (RL) to improve token efficiency and boost the performance of a 1.5B model for mathematical reasoning. We demonstrate this using only 1.5K training samples and found that our SFT+RL model performed better on the GSM8K dataset with varying compute budgets. Our main findings showed an overall higher accuracy while significantly reducing its token usage by over 40% compared to the SFT model, revealing how RL can recover the losses due to long-context training and altogether improving performance in mathematical reasoning.
Similar Papers
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens
Machine Learning (CS)
Lets smart computer programs think faster, cheaper.
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Computation and Language
Makes AI smarter by learning from mistakes.
Learning to Reason Efficiently with Discounted Reinforcement Learning
Machine Learning (CS)
Makes AI think shorter, faster, and just as smart.