RSPO: Regularized Self-Play Alignment of Large Language Models
By: Xiaohang Tang , Sangwoong Yoon , Seongho Son and more
Potential Business Impact:
Makes AI assistants more helpful and less wordy.
Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose \textbf{Regularized Self-Play Policy Optimization (RSPO)}, a general and modular framework that unifies prior methods and enables simple plug-and-play integration of various regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game.Our empirical study involving over $120$ fine-tuned Mistral-7B-Instruct models reveals that forward KL divergence regularization reduces response length, whereas reverse KL divergence markedly improves raw win rates. Crucially, RSPO regularized with a linear combination of forward and reverse KL divergence significantly boosts the length-controlled win rate on AlpacaEval-2 from $28.5\%$ (unregularized self-play, SPPO) to $35.4\%$, and consistently demonstrates superior performance on Arena-Hard, MT-Bench, ArmoRM scores, and response diversity. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in language model alignment.
Similar Papers
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
CV and Pattern Recognition
Teaches AI to learn from its video mistakes.
SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models
Robotics
Teaches robots to learn from their own mistakes.
SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM
Machine Learning (CS)
Makes AI smarter at math and coding faster.