Doubly Robust Alignment for Large Language Models
By: Erhan Xu , Kai Ye , Hongyi Zhou and more
Potential Business Impact:
Makes AI understand what people want better.
This paper studies reinforcement learning from human feedback (RLHF) for aligning large language models with human preferences. While RLHF has demonstrated promising results, many algorithms are highly sensitive to misspecifications in the underlying preference model (e.g., the Bradley-Terry model), the reference policy, or the reward function, resulting in undesirable fine-tuning. To address model misspecification, we propose a doubly robust preference optimization algorithm that remains consistent when either the preference model or the reference policy is correctly specified (without requiring both). Our proposal demonstrates superior and more robust performance than state-of-the-art algorithms, both in theory and in practice. The code is available at https://github.com/DRPO4LLM/DRPO4LLM
Similar Papers
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning
Machine Learning (Stat)
Makes AI understand what people want better.
Distributionally Robust Reinforcement Learning with Human Feedback
Machine Learning (CS)
Makes AI smarter even with new, different questions.
From Demonstrations to Rewards: Alignment Without Explicit Human Preferences
Machine Learning (CS)
Teaches computers to follow instructions better.