Rethinking Expert Trajectory Utilization in LLM Post-training
By: Bowen Ding , Yuhan Chen , Jiayang Lv and more
Potential Business Impact:
Teaches AI better by showing it examples first.
While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
Similar Papers
RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Machine Learning (CS)
Fixes AI mistakes after learning new things.
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Computation and Language
Teaches computers to solve hard problems step-by-step.
Mitigating Forgetting Between Supervised and Reinforcement Learning Yields Stronger Reasoners
Computation and Language
Makes AI smarter by learning from mistakes.