Score: 2

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Published: March 21, 2025 | arXiv ID: 2503.17332v4

By: Yuxuan Zhu , Antony Kellermann , Dylan Bowman and more

Potential Business Impact:

Tests AI's ability to hack websites safely.

Business Areas:

Penetration Testing Information Technology, Privacy and Security

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

Cryptography and Security

Tests if AI can hack computers safely.

13 Oct 2025 0

90%

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Machine Learning (CS)

Tests AI for finding and fixing computer bugs.

13 Jun 2025 2

90%

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Cryptography and Security

Tests AI's real cybersecurity skills, not just knowledge.

28 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

18 pages

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Tests AI's ability to hack websites safely.

Technical Abstract

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents