Optimizing AI Agent Attacks With Synthetic Data
By: Chloe Loughridge , Paul Colognese , Avery Griffin and more
Potential Business Impact:
Makes AI safer by finding its weaknesses.
As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a substantial improvement in attack strength, reducing safety score from a baseline of 0.87 to 0.41 using our scaffold.
Similar Papers
Evaluating Control Protocols for Untrusted AI Agents
Artificial Intelligence
Keeps AI agents from doing bad things safely.
AI Agentic Vulnerability Injection And Transformation with Optimized Reasoning
Cryptography and Security
Creates realistic bugs for training security AI.
Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs
Cryptography and Security
AI is better at defending computers than attacking.