Trust Region Masking for Long-Horizon LLM Reinforcement Learning
By: Yingru Li , Jiacai Liu , Jiawei Xu and more
Policy gradient methods for large language models optimize a surrogate objective computed from samples of a rollout policy $π_{\text{roll}}$. When $π_{\text{roll}} \ne π_θ$, there is approximation error between the surrogate and the true objective. Prior work has shown that this off-policy mismatch is unavoidable in modern LLM-RL due to implementation divergence, mixture-of-experts routing discontinuities, and distributed training staleness. Classical trust region bounds on the resulting error scale as $O(T^2)$ with sequence length $T$, rendering them vacuous for long-horizon tasks. We derive two tighter bounds: a Pinsker-Marginal bound scaling as $O(T^{3/2})$ and a Mixed bound scaling as $O(T)$. Crucially, both bounds depend on $D_{kl}^{tok,max}$ -- the maximum token-level KL divergence across all positions in a sequence. This is inherently a sequence-level quantity: it requires examining the entire trajectory to compute, and therefore cannot be controlled by token-independent methods like PPO clipping. We propose Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.
Similar Papers
TROLL: Trust Regions improve Reinforcement Learning for Large Language Models
Machine Learning (CS)
Makes AI learn faster and better.
Trust-Region Adaptive Policy Optimization
Machine Learning (CS)
Teaches computers to solve math problems better.
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
Machine Learning (CS)
Makes AI learn better and faster from mistakes.