Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
By: Si Chen , Xiao Yu , Ninareh Mehrabi and more
Potential Business Impact:
Finds ways to trick AI, even when it learns.
The exploitation of large language models (LLMs) for malicious purposes poses significant security risks as these models become more powerful and widespread. While most existing red-teaming frameworks focus on single-turn attacks, real-world adversaries typically operate in multi-turn scenarios, iteratively probing for vulnerabilities and adapting their prompts based on threat model responses. In this paper, we propose \AlgName, a novel multi-turn red-teaming agent that emulates sophisticated human attackers through complementary learning dimensions: global tactic-wise learning that accumulates knowledge over time and generalizes to new attack goals, and local prompt-wise learning that refines implementations for specific goals when initial attempts fail. Unlike previous multi-turn approaches that rely on fixed strategy sets, \AlgName enables the agent to identify new jailbreak tactics, develop a goal-based tactic selection framework, and refine prompt formulations for selected tactics. Empirical evaluations on JailbreakBench demonstrate our framework's superior performance, achieving over 90\% attack success rates against GPT-3.5-Turbo and Llama-3.1-70B within 5 conversation turns, outperforming state-of-the-art baselines. These results highlight the effectiveness of dynamic learning in identifying and exploiting model vulnerabilities in realistic multi-turn scenarios.
Similar Papers
X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
Cryptography and Security
Finds ways AI can be tricked in conversations.
RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
Cryptography and Security
Finds ways to trick AI into making bad code.
Multi-lingual Multi-turn Automated Red Teaming for LLMs
Computation and Language
Finds ways for AI to say bad things.