KL-Regularized Reinforcement Learning is Designed to Mode Collapse
By: Anthony GX-Chen , Jatin Prakash , Jeff Guo and more
Potential Business Impact:
Helps AI find many good answers, not just one.
It is commonly believed that optimizing the reverse KL divergence results in "mode seeking", while optimizing forward KL results in "mass covering", with the latter being preferred if the goal is to sample from multiple diverse modes. We show -- mathematically and empirically -- that this intuition does not necessarily transfer well to doing reinforcement learning with reverse/forward KL regularization (e.g. as commonly used with language models). Instead, the choice of reverse/forward KL determines the family of optimal target distributions, parameterized by the regularization coefficient. Mode coverage depends primarily on other factors, such as regularization strength, and relative scales between rewards and reference probabilities. Further, we show commonly used settings such as low regularization strength and equal verifiable rewards tend to specify unimodal target distributions, meaning the optimization objective is, by construction, non-diverse. We leverage these insights to construct a simple, scalable, and theoretically justified algorithm. It makes minimal changes to reward magnitudes, yet optimizes for a target distribution which puts high probability over all high-quality sampling modes. In experiments, this simple modification works to post-train both Large Language Models and Chemical Language Models to have higher solution quality and diversity, without any external signals of diversity, and works with both forward and reverse KL when using either naively fails.
Similar Papers
The Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward
Machine Learning (CS)
Keeps AI smart and prevents it from forgetting.
Achieving Logarithmic Regret in KL-Regularized Zero-Sum Markov Games
Machine Learning (CS)
Teaches computers to learn faster with less data.
Data-regularized Reinforcement Learning for Diffusion Models at Scale
Machine Learning (CS)
Makes AI create better videos that people like.