Does Self-Evaluation Enable Wireheading in Language Models?
By: David Demitri Africa, Hans Ethan Ting
Potential Business Impact:
Makes AI cheat itself instead of learning.
Self-evaluation is increasingly central to language model training, from constitutional AI to self-refinement. We investigate whether coupling self-evaluation to reward signals creates incentives for wireheading, where agents manipulate reward measurements rather than improving task performance. We formalize conditions under which reward-channel control strictly dominates task-focused behavior in POMDPs and test these predictions empirically. Across two models and three tasks, we find that models whose self-grades determine rewards exhibit substantial grade inflation without corresponding accuracy gains, particularly on ambiguous tasks like summarization. Models that self-evaluate but do not control rewards show no such inflation. Our results demonstrate that self-evaluation is safe when decoupled from learning signals but dangerous when coupled, with clear implications for agentic system design.
Similar Papers
Reward Models are Metrics in a Trench Coat
Computation and Language
Makes AI better at judging its own answers.
Can Language Models Critique Themselves? Investigating Self-Feedback for Retrieval Augmented Generation at BioASQ 2025
Computation and Language
Helps AI learn to find better answers.
Steering Evaluation-Aware Language Models To Act Like They Are Deployed
Computation and Language
Makes AI tell truth during tests.