SIRAJ: Diverse and Efficient Red-Teaming for LLM Agents via Distilled Structured Reasoning
By: Kaiwen Zhou , Ahmed Elgohary , A S M Iftekhar and more
Potential Business Impact:
Finds hidden problems in smart computer programs.
The ability of LLM agents to plan and invoke tools exposes them to new safety risks, making a comprehensive red-teaming system crucial for discovering vulnerabilities and ensuring their safe deployment. We present SIRAJ: a generic red-teaming framework for arbitrary black-box LLM agents. We employ a dynamic two-step process that starts with an agent definition and generates diverse seed test cases that cover various risk outcomes, tool-use trajectories, and risk sources. Then, it iteratively constructs and refines model-based adversarial attacks based on the execution trajectories of former attempts. To optimize the red-teaming cost, we present a model distillation approach that leverages structured forms of a teacher model's reasoning to train smaller models that are equally effective. Across diverse evaluation agent settings, our seed test case generation approach yields 2 -- 2.5x boost to the coverage of risk outcomes and tool-calling trajectories. Our distilled 8B red-teamer model improves attack success rate by 100%, surpassing the 671B Deepseek-R1 model. Our ablations and analyses validate the effectiveness of the iterative framework, structured reasoning, and the generalization of our red-teamer models.
Similar Papers
Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System
Cryptography and Security
Finds hidden dangers in AI programs.
Automatic LLM Red Teaming
Machine Learning (CS)
Trains AI to find weaknesses in other AI.
RedTeamLLM: an Agentic AI framework for offensive security
Cryptography and Security
AI finds computer weaknesses before hackers do.