Evaluating LLM Generated Detection Rules in Cybersecurity
By: Anna Bertiger , Bobby Filar , Aryan Luthra and more
Potential Business Impact:
Tests if AI writes good computer security rules.
LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security's detection team and those written by Sublime Security's Automated Detection Engineer (ADE), with a thorough analysis of ADE's skills presented in the results section.
Similar Papers
CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning
Cryptography and Security
Helps computers fight cyberattacks better.
A Practical Guide for Evaluating LLMs and LLM-Reliant Systems
Artificial Intelligence
Tests AI language tools for real-world use.
A Comprehensive Study of LLM Secure Code Generation
Cryptography and Security
Finds flaws in AI-written code.