Score: 0

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

Published: October 13, 2025 | arXiv ID: 2510.11688v1

By: Zicheng Liu , Lige Huang , Jie Zhang and more

Potential Business Impact:

Tests if AI can hack computers safely.

Business Areas:

Penetration Testing Information Technology, Privacy and Security

The increasing autonomy of Large Language Models (LLMs) necessitates a rigorous evaluation of their potential to aid in cyber offense. Existing benchmarks often lack real-world complexity and are thus unable to accurately assess LLMs' cybersecurity capabilities. To address this gap, we introduce PACEbench, a practical AI cyber-exploitation benchmark built on the principles of realistic vulnerability difficulty, environmental complexity, and cyber defenses. Specifically, PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations. To handle these complex challenges, we propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation. Extensive experiments with seven frontier LLMs demonstrate that current models struggle with complex cyber scenarios, and none can bypass defenses. These findings suggest that current models do not yet pose a generalized cyber offense threat. Nonetheless, our work provides a robust benchmark to guide the trustworthy development of future models.

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Cryptography and Security

Tests AI's ability to hack websites safely.

21 Mar 2025 2

89%

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

Cryptography and Security

Tests AI's real cybersecurity skills, not just knowledge.

28 Oct 2025 2

89%

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Machine Learning (CS)

Tests AI for finding and fixing computer bugs.

13 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

19 pages

PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities

Tests if AI can hack computers safely.

Technical Abstract

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks