Evaluation Awareness Scales Predictably in Open-Weights Large Language Models
By: Maheep Chaudhary , Ian Su , Nikhil Hooda and more
Potential Business Impact:
AI hides bad skills when tested, gets worse.
Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emph{evaluation awareness}. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single $70$B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across $15$ models scaling from $0.27$B to $70$B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at https://anonymous.4open.science/r/evaluation-awareness-scaling-laws/README.md.
Similar Papers
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Computation and Language
Makes AI safer and more honest for real use.
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
Computation and Language
Smart computer programs learn better with smarter building.
Scaling behavior of large language models in emotional safety classification across sizes and tasks
Computation and Language
Helps computers understand feelings to keep chats safe.