Reinforcement Learning Teachers of Test Time Scaling
By: Edoardo Cetin, Tianyu Zhao, Yujin Tang
Potential Business Impact:
Teaches computers to explain answers better.
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
Similar Papers
Training Language Models to Reason Efficiently
Machine Learning (CS)
Makes smart computer programs think faster, cheaper.
Interleaved Reasoning for Large Language Models via Reinforcement Learning
Computation and Language
Makes smart computers answer questions faster.
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Artificial Intelligence
Makes computers learn new tricks, but not really.