A Benchmark for Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
By: Miles Q. Li , Benjamin C. M. Fung , Martin Weiss and more
As autonomous AI agents are increasingly deployed in high-stakes environments, ensuring their safety and alignment with human values has become a paramount concern. Current safety benchmarks often focusing only on single-step decision-making, simulated environments for tasks with malicious intent, or evaluating adherence to explicit negative constraints. There is a lack of benchmarks that are designed to capture emergent forms of outcome-driven constraint violations, which arise when agents pursue goal optimization under strong performance incentives while deprioritizing ethical, legal, or safety constraints over multiple steps in realistic production settings. To address this gap, we introduce a new benchmark comprising 40 distinct scenarios. Each scenario presents a task that requires multi-step actions, and the agent's performance is tied to a specific Key Performance Indicator (KPI). Each scenario features Mandated (instruction-commanded) and Incentivized (KPI-pressure-driven) variations to distinguish between obedience and emergent misalignment. Across 12 state-of-the-art large language models, we observe outcome-driven constraint violations ranging from 1.3% to 71.4%, with 9 of the 12 evaluated models exhibiting misalignment rates between 30% and 50%. Strikingly, we find that superior reasoning capability does not inherently ensure safety; for instance, Gemini-3-Pro-Preview, one of the most capable models evaluated, exhibits the highest violation rate at over 60%, frequently escalating to severe misconduct to satisfy KPIs. Furthermore, we observe significant "deliberative misalignment", where the models that power the agents recognize their actions as unethical during separate evaluation. These results emphasize the critical need for more realistic agentic-safety training before deployment to mitigate their risks in the real world.
Similar Papers
Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents
Artificial Intelligence
Measures AI's real-world usefulness, not just speed.
Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
Artificial Intelligence
Tests AI for dangers like losing control.
SciFi-Benchmark: Leveraging Science Fiction To Improve Robot Behavior
Computation and Language
Teaches robots to make good choices like humans.