Score: 0

Outcome-based Exploration for LLM Reasoning

Published: September 8, 2025 | arXiv ID: 2509.06941v1

By: Yuda Song, Julia Kempe, Remi Munos

Potential Business Impact:

Makes AI smarter and more creative.

Business Areas:

A/B Testing Data and Analytics

Reinforcement learning (RL) has emerged as a powerful method for improving the reasoning abilities of large language models (LLMs). Outcome-based RL, which rewards policies solely for the correctness of the final answer, yields substantial accuracy gains but also induces a systematic loss in generation diversity. This collapse undermines real-world performance, where diversity is critical for test-time scaling. We analyze this phenomenon by viewing RL post-training as a sampling process and show that, strikingly, RL can reduce effective diversity even on the training set relative to the base model. Our study highlights two central findings: (i) a transfer of diversity degradation, where reduced diversity on solved problems propagates to unsolved ones, and (ii) the tractability of the outcome space, since reasoning tasks admit only a limited set of distinct answers. Motivated by these insights, we propose outcome-based exploration, which assigns exploration bonuses according to final outcomes. We introduce two complementary algorithms: historical exploration, which encourages rarely observed answers via UCB-style bonuses, and batch exploration, which penalizes within-batch repetition to promote test-time diversity. Experiments on standard competition math with Llama and Qwen models demonstrate that both methods improve accuracy while mitigating diversity collapse. On the theoretical side, we formalize the benefit of outcome-based exploration through a new model of outcome-based bandits. Together, these contributions chart a practical path toward RL methods that enhance reasoning without sacrificing the diversity essential for scalable deployment.

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Machine Learning (CS)

Teaches AI to find new, useful skills.

13 Oct 2025 0

89%

Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

Machine Learning (CS)

Helps computers solve hard problems by remembering rare ideas.

10 Nov 2025 0

89%

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs

Artificial Intelligence

Teaches AI to think better by guiding key choices.

5 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

26 pages

Outcome-based Exploration for LLM Reasoning

Makes AI smarter and more creative.

Technical Abstract

Representation-Based Exploration for Language Models: From Test-Time to Post-Training

Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs