Score: 0

URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Published: July 23, 2025 | arXiv ID: 2507.17515v1

By: Songshuo Lu , Hua Wang , Zhi Chen and more

Potential Business Impact:

Makes AI smarter by learning and judging at once.

Large-scale alignment pipelines typically pair a policy model with a separately trained reward model whose parameters remain frozen during reinforcement learning (RL). This separation creates a complex, resource-intensive pipeline and suffers from a performance ceiling due to a static reward signal. We propose a novel framework, Unified Reward & Policy Optimization (URPO), that unifies instruction-following ("player") and reward modeling ("referee") within a single model and a single training phase. Our method recasts all alignment data-including preference pairs, verifiable reasoning, and open-ended instructions-into a unified generative format optimized by a single Group-Relative Policy Optimization (GRPO) loop. This enables the model to learn from ground-truth preferences and verifiable logic while simultaneously generating its own rewards for open-ended tasks. Experiments on the Qwen2.5-7B model demonstrate URPO's superiority. Our unified model significantly outperforms a strong baseline using a separate generative reward model, boosting the instruction-following score on AlpacaEval from 42.24 to 44.84 and the composite reasoning average from 32.66 to 35.66. Furthermore, URPO cultivates a superior internal evaluator as a byproduct of training, achieving a RewardBench score of 85.15 and surpassing the dedicated reward model it replaces (83.55). By eliminating the need for a separate reward model and fostering a co-evolutionary dynamic between generation and evaluation, URPO presents a simpler, more efficient, and more effective path towards robustly aligned language models.

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

Robotics

Teaches robots new skills safely and quickly.

13 Nov 2025 0

89%

Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

Machine Learning (CS)

Makes AI draw better pictures by fixing mistakes.

13 Dec 2025 1

89%

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards

Computation and Language

AI helps hotels and travelers agree on prices.

5 Oct 2025 0

View PDF Login to Bookmark

Page Count

14 pages

URPO: A Unified Reward & Policy Optimization Framework for Large Language Models

Makes AI smarter by learning and judging at once.

Technical Abstract

Opinion: Towards Unified Expressive Policy Optimization for Robust Robot Learning

Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment

Teaching LLM to be Persuasive: Reward-Enhanced Policy Optimization for Alignment frm Heterogeneous Rewards