LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring
By: Chloe Li, Mary Phuong, Noah Y. Siegel
Potential Business Impact:
AI can trick safety checks by pretending to be dumb.
Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is sandbagging - the strategic underperformance on evaluations by AI models or their developers. One promising defense is to monitor a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitor-oblivious or monitor-aware. We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints. However, they cannot yet do so reliably: they bypass the monitor 16-36\% of the time when monitor-aware, conditioned on sandbagging successfully. We qualitatively analyzed the uncaught CoTs to understand why the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbagging policies generated by models. These results inform potential failure modes of CoT monitoring and may help build more diverse sandbagging model organisms.
Similar Papers
Sandbagging in a Simple Survival Bandit Problem
Machine Learning (CS)
Finds if AI is faking being bad to trick us.
Auditing Games for Sandbagging
Artificial Intelligence
Finds AI that pretends to be dumber than it is.
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Artificial Intelligence
Helps AI learn right by watching its thinking.