Benchmarking Misuse Mitigation Against Covert Adversaries
By: Davis Brown , Mahdi Sabbaghi , Luze Sun and more
Potential Business Impact:
Helps AI avoid helping bad guys do bad things.
Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.
Similar Papers
Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models
Computation and Language
Makes AI safer from bad instructions.
DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
Computation and Language
Tests AI to find computer security problems.
AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
Cryptography and Security
Tests if AI can break computer security defenses.