RePO: ReLU-based Preference Optimization
By: Junkang Wu , Kexin Huang , Xue Wang and more
Potential Business Impact:
Makes AI understand what you want better.
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $\beta$, subsequent methods like SimPO reintroduce complexity through dual parameters ($\beta$, $\gamma$). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $\beta$ via two advances: (1) retaining SimPO's reference-free margins but removing $\beta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ($\beta \to \infty$), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
Similar Papers
Robust Preference Optimization via Dynamic Target Margins
Computation and Language
Makes AI smarter and safer by fixing bad training data.
A Survey of Direct Preference Optimization
Machine Learning (CS)
Teaches computers to be helpful and safe.
BPO: Revisiting Preference Modeling in Direct Preference Optimization
Computation and Language
Makes AI better at math and following instructions.