Score: 0

KL-Regularised Q-Learning: A Token-level Action-Value perspective on Online RLHF

Published: August 23, 2025 | arXiv ID: 2508.17000v1

By: Jason R Brown , Lennie Wells , Edward James Young and more

Potential Business Impact:

Makes AI write better stories and talk smarter.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Proximal Policy Optimisation (PPO) is an established and effective policy gradient algorithm used for Language Model Reinforcement Learning from Human Feedback (LM-RLHF). PPO performs well empirically but has a heuristic motivation and handles the KL-divergence constraint used in LM-RLHF in an ad-hoc manner. In this paper, we develop a a new action-value RL method for the LM-RLHF setting, KL-regularised Q-Learning (KLQ). We then show that our method is equivalent to a version of PPO in a certain specific sense, despite its very different motivation. Finally, we benchmark KLQ on two key language generation tasks -- summarisation and single-turn dialogue. We demonstrate that KLQ performs on-par with PPO at optimising the LM-RLHF objective, and achieves a consistently higher win-rate against PPO on LLM-as-a-judge evaluations.

Country of Origin
🇬🇧 United Kingdom

Page Count
26 pages

Category
Computer Science:
Computation and Language