Score: 2

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Published: October 28, 2025 | arXiv ID: 2510.24317v1

By: María Sanz-Gómez , Víctor Mayoral-Vilches , Francesco Balassone and more

Potential Business Impact:

Tests AI's real cybersecurity skills, not just knowledge.

Business Areas:

Artificial Intelligence Artificial Intelligence, Data and Analytics, Science and Engineering, Software

Cybersecurity spans multiple interconnected domains, complicating the development of meaningful, labor-relevant benchmarks. Existing benchmarks assess isolated skills rather than integrated performance. We find that pre-trained knowledge of cybersecurity in LLMs does not imply attack and defense abilities, revealing a gap between knowledge and capability. To address this limitation, we present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework that allows evaluating LLM models and agents across offensive and defensive cybersecurity domains, taking a step towards meaningfully measuring their labor-relevance. CAIBench integrates five evaluation categories, covering over 10,000 instances: Jeopardy-style CTFs, Attack and Defense CTFs, Cyber Range exercises, knowledge benchmarks, and privacy assessments. Key novel contributions include systematic simultaneous offensive-defensive evaluation, robotics-focused cybersecurity challenges (RCTF2), and privacy-preserving performance assessment (CyberPII-Bench). Evaluation of state-of-the-art AI models reveals saturation on security knowledge metrics (~70\% success) but substantial degradation in multi-step adversarial (A\&D) scenarios (20-40\% success), or worse in robotic targets (22\% success). The combination of framework scaffolding and LLM model choice significantly impacts performance; we find that proper matches improve up to 2.6$\times$ variance in Attack and Defense CTFs. These results demonstrate a pronounced gap between conceptual knowledge and adaptive capability, emphasizing the need for a meta-benchmark.

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Artificial Intelligence

AI fails when tricked by fake information.

30 Sep 2025 1

90%

Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs

Cryptography and Security

AI is better at defending computers than attacking.

20 Oct 2025 1

90%

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Cryptography and Security

Tests AI's ability to hack websites safely.

21 Mar 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

37 pages

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Tests AI's real cybersecurity skills, not just knowledge.

Technical Abstract

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities