Score: 3

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Published: October 15, 2025 | arXiv ID: 2510.13694v1

By: Yuchun Miao , Liang Ding , Sen Zhang and more

Potential Business Impact:

Stops AI from cheating to get good answers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Despite the success of Reinforcement Learning from Human Feedback (RLHF) in aligning language models with human values, reward hacking-or reward over-optimization-remains a major challenge. We identify two key obstacles to its mitigation: (1) reward misgeneralization in reward modeling, where reward models overfit to spurious, preference-irrelevant features; and (2) the lack of suitable regularization during RL optimization, as existing token-level constraints often over-restrict the policy space. To address these issues, we propose InfoRM, an information-theoretic reward modeling framework based on the Information Bottleneck (IB) principle, which filters out preference-irrelevant information to alleviate reward misgeneralization. We further observe that reward-hacked responses manifest as pronounced outliers in InfoRM's IB latent space, measured by Mahalanobis distance from the SFT-induced distribution. Motivated by this, we introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape while maintaining alignment. We prove that IBL is theoretically equivalent to the pessimistic RL objective within the IB latent space. Finally, we present Mahalanobis Outlier Probability (MOP), a statistical metric for quantifying reward hacking severity, enabling principled hyperparameter tuning and online mitigation such as early stopping. Extensive experiments across diverse LLMs and datasets confirm the generality of our findings, the effectiveness of InfoRM and IBL, and the reliability of MOP as a diagnostic tool-collectively advancing the state of RLHF.

Revisiting LLM Reasoning via Information Bottleneck

Artificial Intelligence

Makes computers think better at math problems.

24 Jul 2025 2

88%

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

Machine Learning (CS)

Teaches AI to be safer by learning from mistakes.

7 Oct 2025 0

88%

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

Machine Learning (Stat)

Makes AI understand what people want better.

3 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇸🇬 China, Singapore

Repos / Data Links

github.com huggingface.co

Page Count

46 pages

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Stops AI from cheating to get good answers.

Technical Abstract

Revisiting LLM Reasoning via Information Bottleneck

Learning from Failures: Understanding LLM Alignment through Failure-Aware Inverse RL

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning