Score: 1

RSPO: Regularized Self-Play Alignment of Large Language Models

Published: February 24, 2025 | arXiv ID: 2503.00030v2

By: Xiaohang Tang , Sangwoong Yoon , Seongho Son and more

Potential Business Impact:

Makes AI assistants more helpful and less wordy.

Business Areas:

A/B Testing Data and Analytics

Self-play alignment has emerged as an effective approach for fine-tuning large language models (LLMs), formulating preference optimization as a two-player game. However, the regularization with respect to the reference policy, which is crucial for mitigating over-optimization, has been insufficiently investigated in self-play alignment. To study the impact of different regularization strategies, we propose \textbf{Regularized Self-Play Policy Optimization (RSPO)}, a general and modular framework that unifies prior methods and enables simple plug-and-play integration of various regularizers, meanwhile preserving convergence to Nash equilibrium of the corresponding regularized game.Our empirical study involving over $120$ fine-tuned Mistral-7B-Instruct models reveals that forward KL divergence regularization reduces response length, whereas reverse KL divergence markedly improves raw win rates. Crucially, RSPO regularized with a linear combination of forward and reverse KL divergence significantly boosts the length-controlled win rate on AlpacaEval-2 from $28.5\%$ (unregularized self-play, SPPO) to $35.4\%$, and consistently demonstrates superior performance on Arena-Hard, MT-Bench, ArmoRM scores, and response diversity. Combining simplicity, convergence guarantees, and significant empirical gains, RSPO offers a strong foundation for exploring regularized self-play in language model alignment.