Reward Hacking Mitigation using Verifiable Composite Rewards
By: Mirza Farhan Bin Tarek, Rahmatollah Beheshti
Potential Business Impact:
Teaches AI to answer health questions correctly.
Reinforcement Learning from Verifiable Rewards (RLVR) has recently shown that large language models (LLMs) can develop their own reasoning without direct supervision. However, applications in the medical domain, specifically for question answering, are susceptible to significant reward hacking during the reasoning phase. Our work addresses two primary forms of this behavior: i) providing a final answer without preceding reasoning, and ii) employing non-standard reasoning formats to exploit the reward mechanism. To mitigate these, we introduce a composite reward function with specific penalties for these behaviors. Our experiments show that extending RLVR with our proposed reward model leads to better-formatted reasoning with less reward hacking and good accuracy compared to the baselines. This approach marks a step toward reducing reward hacking and enhancing the reliability of models utilizing RLVR.
Similar Papers
Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs
Artificial Intelligence
Makes AI think more logically, not just guess.
The Reasoning Boundary Paradox: How Reinforcement Learning Constrains Language Models
Artificial Intelligence
Fixes AI reasoning errors by focusing on hard problems.
AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning
Computation and Language
Teaches AI to think step-by-step, not just guess.