AMIR-GRPO: Inducing Implicit Preference Signals into GRPO
By: Amir Hossein Yari, Fajri Koto
Potential Business Impact:
Helps AI learn to solve math problems better.
Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.
Similar Papers
On the Hidden Objective Biases of Group-based Reinforcement Learning
Machine Learning (CS)
Fixes AI learning to be more fair and accurate.
IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning
Machine Learning (CS)
Makes AI learn faster and better from feedback.
GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
Machine Learning (CS)
Teaches AI to learn better from data.