Score: 0

GRAM: A Generative Foundation Reward Model for Reward Generalization

Published: June 17, 2025 | arXiv ID: 2506.14175v2

By: Chenglong Wang , Yang Gan , Yifu Huo and more

Potential Business Impact:

Teaches AI to learn better from more data.

Business Areas:

Gamification Gaming

In aligning large language models (LLMs), reward models have played an important role, but are standardly trained as discriminative models and rely only on labeled human preference data. In this paper, we explore methods that train reward models using both unlabeled and labeled data. Building on the generative models in LLMs, we develop a generative reward model that is first trained via large-scale unsupervised learning and then fine-tuned via supervised learning. We also show that by using label smoothing, we are in fact optimizing a regularized pairwise ranking loss. This result, in turn, provides a new view of training reward models, which links generative models and discriminative models under the same class of training objectives. The outcome of these techniques is a foundation reward model, which can be applied to a wide range of tasks with little or no further fine-tuning effort. Extensive experiments show that this model generalizes well across several tasks, including response ranking, reinforcement learning from human feedback, and task adaptation with fine-tuning, achieving significant performance improvements over several strong baseline models.

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Computation and Language

Teaches AI to explain why it likes answers.

2 Sep 2025 0

91%

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Computation and Language

Teaches AI to explain why it picks answers.

2 Sep 2025 0

90%

Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference

Computation and Language

Makes AI understand what people like better.

1 Mar 2025 2

View PDF Login to Bookmark

Page Count

21 pages

GRAM: A Generative Foundation Reward Model for Reward Generalization

Teaches AI to learn better from more data.

Technical Abstract

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

GRAM-R$^2$: Self-Training Generative Foundation Reward Models for Reward Reasoning

Sentence-level Reward Model can Generalize Better for Aligning LLM from Human Preference