Score: 2

Rethinking Expert Trajectory Utilization in LLM Post-training

Published: December 12, 2025 | arXiv ID: 2512.11470v1

By: Bowen Ding , Yuhan Chen , Jiayang Lv and more

BigTech Affiliations: Huawei

Potential Business Impact:

Teaches AI better by showing it examples first.

Business Areas:
Machine Learning Artificial Intelligence, Data and Analytics, Software

While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
24 pages

Category
Computer Science:
Machine Learning (CS)