KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF
By: Jason R Brown , Lennie Wells , Edward James Young and more
Potential Business Impact:
Makes AI write better stories and talk smarter.
Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.
Similar Papers
Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO
Machine Learning (CS)
AI chats learn to finish tasks better.
DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
Machine Learning (CS)
Teaches AI to learn better from messy information.
Adaptive Alpha Weighting with PPO: Enhancing Prompt-Based LLM-Generated Alphas in Quant Trading
Computational Engineering, Finance, and Science
Teaches computers to pick winning stocks better.