Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections
By: Bo Wang , Qinyuan Cheng , Runyu Peng and more
Potential Business Impact:
Makes AI better at following instructions.
Post-training processes are essential phases in grounding pre-trained language models to real-world tasks, with learning from demonstrations or preference signals playing a crucial role in this adaptation. We present a unified theoretical framework bridging Supervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training. Through rigorous mathematical derivation, we demonstrate that both SFT and preference learning methods like Direct Preference Optimization (DPO) operate within the same optimal policy-reward subspace, with SFT representing a special case of implicit reward learning. Our analysis reveals a critical limitation in conventional SFT: the KL divergence term in distribution matching becomes constant with respect to the policy during optimization, failing to constrain model updates. To address this, we propose a simple yet effective learning rate reduction approach that yields significant performance improvements (up to \textbf{25\%} relative gain and \textbf{6\%} absolute win rate increase in instruction following tasks. Additionally, we derive alternative SFT objectives from various f-divergence functions that preserve the KL term during optimization, further enhancing post-DPO model performance. Finally, we extend the theoretical relationship between LLM logits and Q-functions from preference learning to the SFT context, providing mathematical derivations and experimental validation.
Similar Papers
Assessing Robustness to Spurious Correlations in Post-Training Language Models
Computation and Language
Teaches AI to ignore bad information.
Explicit Preference Optimization: No Need for an Implicit Reward Model
Machine Learning (CS)
Makes AI learn better without extra steps.
Learning to Align Human Code Preferences
Software Engineering
Teaches computers to write better code.