On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training
By: Xueyan Niu , Bo Bai , Wei Han and more
Potential Business Impact:
Mixing AI training methods hurts its smartness.
Post-training of large language models routinely interleaves supervised fine-tuning (SFT) with reinforcement learning (RL). These two methods have different objectives: SFT minimizes the cross-entropy loss between model outputs and expert responses, while RL maximizes reward signals derived from human preferences or rule-based verifiers. Modern reasoning models have widely adopted the practice of alternating SFT and RL training. However, there is no theoretical account of whether they can be decoupled. We prove that decoupling is impossible in either order: (1) SFT-then-RL coupling: RL increases SFT loss under SFT optimality and (2) RL-then-SFT coupling: SFT lowers the reward achieved by RL. Experiments on Qwen3-0.6B confirm the predicted degradation, verifying that SFT and RL cannot be separated without loss of prior performance in the post-training
Similar Papers
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Computation and Language
Makes AI smarter by learning from mistakes.
Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Artificial Intelligence
Teaches computers new things to solve harder problems.
RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Machine Learning (CS)
Fixes AI mistakes after learning new things.