On the Hidden Objective Biases of Group-based Reinforcement Learning
By: Aleksandar Fontana , Marco Simoni , Giulio Rossolini and more
Potential Business Impact:
Fixes AI learning to be more fair and accurate.
Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.
Similar Papers
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
Machine Learning (CS)
Teaches AI to learn better from data.
Group Causal Policy Optimization for Post-Training Large Language Models
Machine Learning (CS)
Makes AI better at choosing the best answers.
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
Machine Learning (CS)
Makes AI smarter by improving its success rate.