Score: 1

AMIR-GRPO: Inducing Implicit Preference Signals into GRPO

Published: January 7, 2026 | arXiv ID: 2601.03661v1

By: Amir Hossein Yari, Fajri Koto

Potential Business Impact:

Helps AI learn to solve math problems better.

Business Areas:
Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reinforcement learning has become the primary paradigm for aligning large language models (LLMs) on complex reasoning tasks, with group relative policy optimization (GRPO) widely used in large-scale post-training. However, GRPO faces structural limitations in reasoning-heavy settings: sequence-level advantage normalization introduces systematic length bias, penalties for low-quality trajectories are diluted, and the scalar objective discards rich pairwise preference information embedded in within-group reward rankings. As a result, valuable supervision from costly rollouts remains underutilized. We propose AMIR-GRPO, which augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings, requiring no additional annotations. This mechanism amplifies suppression of low-reward trajectories, attenuates response-level length bias, and transforms each rollout group into a denser set of supervision constraints. Across multiple mathematical reasoning benchmarks, AMIR-GRPO consistently outperforms strong GRPO baselines, yields clearer separation between correct and incorrect reasoning chains, and delivers broader coverage gains beyond the subset of instances solved by standard GRPO.

Country of Origin
🇦🇪 United Arab Emirates

Repos / Data Links

Page Count
16 pages

Category
Computer Science:
Machine Learning (CS)