Score: 4

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Published: May 31, 2025 | arXiv ID: 2506.00739v3

By: Chiyu Zhang , Marc-Alexandre Cote , Michael Albada and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Tests AI to find computer security problems.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Cryptography and Security

Tests AI's ability to hack websites safely.

21 Mar 2025 2

89%

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

Cryptography and Security

Helps computers fight cyberattacks better.

24 Sep 2025 1

89%

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Cryptography and Security

Helps computers understand computer attack dangers better.

3 Nov 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com huggingface.co

Page Count

14 pages

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Tests AI to find computer security problems.

Technical Abstract

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence