Score: 1

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Published: June 9, 2025 | arXiv ID: 2506.07527v1

By: Lu Ma , Hao Liang , Meiyi Qiang and more

Potential Business Impact:

Teaches computers new things to solve harder problems.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Machine Learning (CS)

Mixing AI training methods hurts its smartness.

12 Jan 2026 1

91%

Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Computation and Language

Makes AI smarter by learning from mistakes.

6 Oct 2025 1

91%

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning

Machine Learning (CS)

Makes AI better at thinking, even small ones.

14 Dec 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

12 pages

Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions

Teaches computers new things to solve harder problems.

Technical Abstract

On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training

Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners

Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning