Score: 1

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Published: October 2, 2025 | arXiv ID: 2510.02230v1

By: Phuc Minh Nguyen , Chinh D. La , Duy M. H. Nguyen and more

Potential Business Impact:

Fixes AI reasoning errors by focusing on hard problems.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key method for improving Large Language Models' reasoning capabilities, yet recent evidence suggests it may paradoxically shrink the reasoning boundary rather than expand it. This paper investigates the shrinkage issue of RLVR by analyzing its learning dynamics and reveals two critical phenomena that explain this failure. First, we expose negative interference in RLVR, where learning to solve certain training problems actively reduces the likelihood of correct solutions for others, leading to the decline of Pass@$k$ performance, or the probability of generating a correct solution within $k$ attempts. Second, we uncover the winner-take-all phenomenon: RLVR disproportionately reinforces problems with high likelihood, correct solutions, under the base model, while suppressing other initially low-likelihood ones. Through extensive theoretical and empirical analysis on multiple mathematical reasoning benchmarks, we show that this effect arises from the inherent on-policy sampling in standard RL objectives, causing the model to converge toward narrow solution strategies. Based on these insights, we propose a simple yet effective data curation algorithm that focuses RLVR learning on low-likelihood problems, achieving notable improvement in Pass@$k$ performance. Our code is available at https://github.com/mail-research/SELF-llm-interference.

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Artificial Intelligence

Makes AI think more logically, not just guess.

17 Jun 2025 1

92%

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Artificial Intelligence

Makes computers learn new tricks, but not really.

18 Apr 2025 1

92%

The Invisible Leash: Why RLVR May Not Escape Its Origin

Machine Learning (CS)

AI learns better, but might miss new ideas.

20 Jul 2025 3

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

23 pages

The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models

Fixes AI reasoning errors by focusing on hard problems.

Technical Abstract

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

The Invisible Leash: Why RLVR May Not Escape Its Origin