Safety by Measurement: A Systematic Literature Review of AI Safety Evaluation Methods
By: Markov Grey, Charbel-Raphaël Segerie
Potential Business Impact:
Tests AI for dangerous tricks and hidden goals.
As frontier AI systems advance toward transformative capabilities, we need a parallel transformation in how we measure and evaluate these systems to ensure safety and inform governance. While benchmarks have been the primary method for estimating model capabilities, they often fail to establish true upper bounds or predict deployment behavior. This literature review consolidates the rapidly evolving field of AI safety evaluations, proposing a systematic taxonomy around three dimensions: what properties we measure, how we measure them, and how these measurements integrate into frameworks. We show how evaluations go beyond benchmarks by measuring what models can do when pushed to the limit (capabilities), the behavioral tendencies exhibited by default (propensities), and whether our safety measures remain effective even when faced with subversive adversarial AI (control). These properties are measured through behavioral techniques like scaffolding, red teaming and supervised fine-tuning, alongside internal techniques such as representation analysis and mechanistic interpretability. We provide deeper explanations of some safety-critical capabilities like cybersecurity exploitation, deception, autonomous replication, and situational awareness, alongside concerning propensities like power-seeking and scheming. The review explores how these evaluation methods integrate into governance frameworks to translate results into concrete development decisions. We also highlight challenges to safety evaluations - proving absence of capabilities, potential model sandbagging, and incentives for "safetywashing" - while identifying promising research directions. By synthesizing scattered resources, this literature review aims to provide a central reference point for understanding AI safety evaluations.
Similar Papers
Toward an Evaluation Science for Generative AI Systems
Artificial Intelligence
Tests AI to make sure it's safe and works.
Measuring What Matters: Connecting AI Ethics Evaluations to System Attributes, Hazards, and Harms
Human-Computer Interaction
Helps AI systems avoid causing harm.
Evaluating AI Companies' Frontier Safety Frameworks: Methodology and Results
Computers and Society
Helps AI companies build safer, more responsible systems.