RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning
By: Artur Horal , Daniel Pina , Henrique Paz and more
Potential Business Impact:
Finds ways to trick AI into making bad code.
This paper presents the vision, scientific contributions, and technical details of RedTWIZ: an adaptive and diverse multi-turn red teaming framework, to audit the robustness of Large Language Models (LLMs) in AI-assisted software development. Our work is driven by three major research streams: (1) robust and systematic assessment of LLM conversational jailbreaks; (2) a diverse generative multi-turn attack suite, supporting compositional, realistic and goal-oriented jailbreak conversational strategies; and (3) a hierarchical attack planner, which adaptively plans, serializes, and triggers attacks tailored to specific LLM's vulnerabilities. Together, these contributions form a unified framework -- combining assessment, attack generation, and strategic planning -- to comprehensively evaluate and expose weaknesses in LLMs' robustness. Extensive evaluation is conducted to systematically assess and analyze the performance of the overall system and each component. Experimental results demonstrate that our multi-turn adversarial attack strategies can successfully lead state-of-the-art LLMs to produce unsafe generations, highlighting the pressing need for more research into enhancing LLM's robustness.
Similar Papers
Strategize Globally, Adapt Locally: A Multi-Turn Red Teaming Agent with Dual-Level Learning
Artificial Intelligence
Finds ways to trick AI, even when it learns.
RedTeamLLM: an Agentic AI framework for offensive security
Cryptography and Security
AI finds computer weaknesses before hackers do.
Multi-lingual Multi-turn Automated Red Teaming for LLMs
Computation and Language
Finds ways for AI to say bad things.