On a Connection Between Imitation Learning and RLHF
By: Teng Xiao , Yige Yuan , Mingxiao Li and more
Potential Business Impact:
Teaches computers to follow human instructions better.
This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback RLHF and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks.
Similar Papers
Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL
Machine Learning (CS)
Teaches AI to be safer by learning from mistakes.
Doubly Robust Alignment for Large Language Models
Machine Learning (CS)
Makes AI understand what people want better.
Aligning to What? Limits to RLHF Based Alignment
Computation and Language
Fixes AI bias, but not perfectly yet.