Score: 2

IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

Published: January 2, 2026 | arXiv ID: 2601.00677v1

By: Haonan Song , Qingchen Xie , Huan Zhu and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Makes AI learn faster and better from feedback.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Generative Reward Models (GRMs) have attracted considerable research interest in reward modeling due to their interpretability, inference-time scalability, and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck when integrated with RL algorithms such as Group Relative Policy Optimization (GRPO). This bottleneck arises from two factors: (i) the O(n^2) time complexity of pairwise comparisons required to obtain relative scores, and (ii) the computational overhead of repeated sampling or additional chain-of-thought (CoT) reasoning to improve performance. To address the first factor, we propose Intergroup Relative Preference Optimization (IRPO), a novel RL framework that incorporates the well-established Bradley-Terry model into GRPO. By generating a pointwise score for each response, IRPO enables efficient evaluation of arbitrarily many candidates during RL training while preserving interpretability and fine-grained reward signals. Experimental results demonstrate that IRPO achieves state-of-the-art (SOTA) performance among pointwise GRMs across multiple benchmarks, with performance comparable to that of current leading pairwise GRMs. Furthermore, we show that IRPO significantly outperforms pairwise GRMs in post-training evaluations.

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI better at thinking and solving problems.

26 Nov 2025 1

91%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI think better and avoid mistakes.

26 Nov 2025 1

91%

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Artificial Intelligence

Makes AI think better and learn from mistakes.

26 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

IRPO: Scaling the Bradley-Terry Model via Reinforcement Learning

Makes AI learn faster and better from feedback.

Technical Abstract

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning