Bootstrapping LLMs via Preference-Based Policy Optimization
By: Chen Jia
Potential Business Impact:
Teaches AI to follow human wishes better.
Bootstrapping large language models (LLMs) through preference-based policy optimization offers a promising direction for aligning model behavior with human preferences without relying on extensive manual annotations. In this work, we propose a novel preference-based policy optimization (PbPO) framework that formulates the learning process as a min-max game between the main policy and a reward model (RM). The RM is constrained within a confidence set derived from preference data to ensure reliable exploitation. Our iterative online algorithm actively collects preference data through guided exploration of the evolving policy, enabling continual self-improvement of both the policy and the RM. We provide theoretical guarantees for our method, establishing high-probability regret bounds for both settings with sequence-level RM and token-level RM, demonstrating its effectiveness in bootstrapping LLMs. Extensive experiments on five benchmarks show that our approach consistently outperforms existing state-of-the-art preference optimization techniques.
Similar Papers
Offline Preference Optimization via Maximum Marginal Likelihood Estimation
Machine Learning (CS)
Makes AI understand what you like better.
IPO: Your Language Model is Secretly a Preference Classifier
Computation and Language
Makes AI learn what people like without asking.
Robust LLM Alignment via Distributionally Robust Direct Preference Optimization
Machine Learning (CS)
Makes AI understand what people want better.