Score: 0

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Published: September 2, 2025 | arXiv ID: 2509.02492v2

By: Chenglong Wang , Yongyu Mu , Hang Zhou and more

Potential Business Impact:

Teaches AI to explain why it likes answers.

Business Areas:

A/B Testing Data and Analytics

Significant progress in reward modeling over recent years has been driven by a paradigm shift from task-specific designs towards generalist reward models. Despite this trend, developing effective reward models remains a fundamental challenge: the heavy reliance on large-scale labeled preference data. Pre-training on abundant unlabeled data offers a promising direction, but existing approaches fall short of instilling explicit reasoning into reward models. To bridge this gap, we propose a self-training approach that leverages unlabeled data to elicit reward reasoning in reward models. Based on this approach, we develop GRAM-R$^2$, a generative reward model trained to produce not only preference labels but also accompanying reward rationales. GRAM-R$^2$ can serve as a foundation model for reward reasoning and can be applied to a wide range of tasks with minimal or no additional fine-tuning. It can support downstream applications such as response ranking and task-specific reward tuning. Experiments on response ranking, task adaptation, and reinforcement learning from human feedback demonstrate that GRAM-R$^2$ consistently delivers strong performance, outperforming several strong discriminative and generative baselines.

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Computation and Language

Teaches AI to explain why it picks answers.

2 Sep 2025 0

92%

GRAM: A Generative Foundation Reward Model for Reward Generalization

Computation and Language

Teaches AI to learn better from more data.

17 Jun 2025 0

91%

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning

Machine Learning (CS)

Makes AI learn medicine from generated data.

28 Aug 2025 1

View PDF Login to Bookmark

Page Count

23 pages

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Teaches AI to explain why it likes answers.

Technical Abstract

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

GRAM: A Generative Foundation Reward Model for Reward Generalization

MedGR$^2$: Breaking the Data Barrier for Medical Reasoning via Generative Reward Learning