Score: 1

Spurious Rewards: Rethinking Training Signals in RLVR

Published: June 12, 2025 | arXiv ID: 2506.10947v1

By: Rulin Shao , Shuyue Stella Li , Rui Xin and more

BigTech Affiliations: University of Washington

Potential Business Impact:

Teaches AI to do math better, even with wrong answers.

Business Areas:

Virtual Reality Hardware, Software

We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in absolute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting) -- nearly matching the 29.1% gained with ground truth rewards. However, the spurious rewards that work for Qwen often fail to yield gains with other model families like Llama3 or OLMo2. In particular, we find code reasoning -- thinking in code without actual code execution -- to be a distinctive Qwen2.5-Math behavior that becomes significantly more frequent after RLVR, from 65% to over 90%, even with spurious rewards. Overall, we hypothesize that, given the lack of useful reward signal, RLVR must somehow be surfacing useful reasoning representations learned during pretraining, although the exact mechanism remains a topic for future work. We suggest that future RLVR research should possibly be validated on diverse models rather than a single de facto choice, as we show that it is easy to get significant performance gains on Qwen models even with completely spurious reward signals.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Artificial Intelligence

Makes computers learn new tricks, but not really.

18 Apr 2025 1

90%

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Machine Learning (CS)

Teaches computers math with one example.

29 Apr 2025 3

90%

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Artificial Intelligence

Fixes AI reasoning errors by focusing on hard problems.

2 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

43 pages

Spurious Rewards: Rethinking Training Signals in RLVR

Teaches AI to do math better, even with wrong answers.

Technical Abstract

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models