Exploration v.s. Exploitation: Rethinking RLVR through Clipping, Entropy, and Spurious Reward
By: Peter Chen , Xiaopeng Li , Ziniu Li and more
Potential Business Impact:
Makes AI better at math by tricking it.
This paper examines the exploration-exploitation trade-off in reinforcement learning with verifiable rewards (RLVR), a framework for improving the reasoning of Large Language Models (LLMs). Recent studies suggest that RLVR can elicit strong mathematical reasoning in LLMs through two seemingly paradoxical mechanisms: spurious rewards, which suppress exploitation by rewarding outcomes unrelated to the ground truth, and entropy minimization, which suppresses exploration by pushing the model toward more confident and deterministic outputs, highlighting a puzzling dynamic: both discouraging exploitation and discouraging exploration improve reasoning performance, yet the underlying principles that reconcile these effects remain poorly understood. We focus on two fundamental questions: (i) how policy entropy relates to performance, and (ii) whether spurious rewards yield gains, potentially through the interplay of clipping bias and model contamination. Our results show that clipping bias under spurious rewards reduces policy entropy, leading to more confident and deterministic outputs, while entropy minimization alone is insufficient for improvement. We further propose a reward-misalignment model explaining why spurious rewards can enhance performance beyond contaminated settings. Our findings clarify the mechanisms behind spurious-reward benefits and provide principles for more effective RLVR training.
Similar Papers
Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs
Computation and Language
Helps computers think deeper and solve harder problems.
Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning
Computation and Language
Teaches AI to learn better by watching its mistakes.
Spurious Rewards: Rethinking Training Signals in RLVR
Artificial Intelligence
Teaches AI to do math better, even with wrong answers.