Score: 2

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Published: August 5, 2025 | arXiv ID: 2508.05674v1

By: Minghao Shao , Nanda Rani , Kimberly Milner and more

Potential Business Impact:

Teaches computers to hack systems like a pro.

Recent advances in LLM agentic systems have improved the automation of offensive security tasks, particularly for Capture the Flag (CTF) challenges. We systematically investigate the key factors that drive agent success and provide a detailed recipe for building effective LLM-based offensive security agents. First, we present CTFJudge, a framework leveraging LLM as a judge to analyze agent trajectories and provide granular evaluation across CTF solving steps. Second, we propose a novel metric, CTF Competency Index (CCI) for partial correctness, revealing how closely agent solutions align with human-crafted gold standards. Third, we examine how LLM hyperparameters, namely temperature, top-p, and maximum token length, influence agent performance and automated cybersecurity task planning. For rapid evaluation, we present CTFTiny, a curated benchmark of 50 representative CTF challenges across binary exploitation, web, reverse engineering, forensics, and cryptography. Our findings identify optimal multi-agent coordination settings and lay the groundwork for future LLM agent research in cybersecurity. We make CTFTiny open source to public https://github.com/NYU-LLM-CTF/CTFTiny along with CTFJudge on https://github.com/NYU-LLM-CTF/CTFJudge.

CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

Cryptography and Security

Helps find computer attackers and their tricks.

28 Aug 2025 1

88%

D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security

Artificial Intelligence

Helps computers hack systems better and faster.

15 Feb 2025 2

88%

Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing

Cryptography and Security

Helps computers hack systems automatically and safely.

21 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇮🇳 🇺🇸 India, United States

Repos / Data Links

github.com

Page Count

14 pages

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Teaches computers to hack systems like a pro.

Technical Abstract

CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security

Construction and Evaluation of LLM-based agents for Semi-Autonomous penetration testing