Spurious Rewards Paradox: Mechanistically Understanding How RLVR Activates Memorization Shortcuts in LLMs
By: Lecheng Yan , Ruizhe Li , Guanhua Chen and more
Potential Business Impact:
Fixes AI that cheats by remembering answers.
Reinforcement Learning with Verifiable Rewards (RLVR) is highly effective for enhancing LLM reasoning, yet recent evidence shows models like Qwen 2.5 achieve significant gains even with spurious or incorrect rewards. We investigate this phenomenon and identify a "Perplexity Paradox": spurious RLVR triggers a divergence where answer-token perplexity drops while prompt-side coherence degrades, suggesting the model is bypassing reasoning in favor of memorization. Using Path Patching, Logit Lens, JSD analysis, and Neural Differential Equations, we uncover a hidden Anchor-Adapter circuit that facilitates this shortcut. We localize a Functional Anchor in the middle layers (L18-20) that triggers the retrieval of memorized solutions, followed by Structural Adapters in later layers (L21+) that transform representations to accommodate the shortcut signal. Finally, we demonstrate that scaling specific MLP keys within this circuit allows for bidirectional causal steering-artificially amplifying or suppressing contamination-driven performance. Our results provide a mechanistic roadmap for identifying and mitigating data contamination in RLVR-tuned models. Code is available at https://github.com/idwts/How-RLVR-Activates-Memorization-Shortcuts.
Similar Papers
The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
Artificial Intelligence
Fixes AI reasoning errors by focusing on hard problems.
Spurious Rewards: Rethinking Training Signals in RLVR
Artificial Intelligence
Teaches AI to do math better, even with wrong answers.
Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards
Computation and Language
Teaches computers to solve math problems better.