RoiRL: Efficient, Self-Supervised Reasoning with Offline Iterative Reinforcement Learning
By: Aleksei Arzhantsev, Otmane Sakhi, Flavian Vasile
Potential Business Impact:
Makes AI smarter without needing constant human help.
Reinforcement learning (RL) is central to improving reasoning in large language models (LLMs) but typically requires ground-truth rewards. Test-Time Reinforcement Learning (TTRL) removes this need by using majority-vote rewards, but relies on heavy online RL and incurs substantial computational cost. We propose RoiRL: Reasoning with offline iterative Reinforcement Learning, a family of lightweight offline learning alternatives that can target the same regularized optimal policies. Unlike TTRL, RoiRL eliminates the need to maintain a reference model and instead optimizes weighted log-likelihood objectives, enabling stable training with significantly lower memory and compute requirements. Experimental results show that RoiRL trains to 2.5x faster and consistently outperforms TTRL on reasoning benchmarks, establishing a scalable path to self-improving LLMs without labels.
Similar Papers
TTRL: Test-Time Reinforcement Learning
Computation and Language
Teaches AI to learn from its own mistakes.
Reinforcement Learning Teachers of Test Time Scaling
Machine Learning (CS)
Teaches computers to explain answers better.
Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
Machine Learning (CS)
Teaches computers to learn better from mistakes.