AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming
By: Muxi Diao , Yutao Mou , Keqing He and more
Potential Business Impact:
Finds hidden dangers in AI programs.
The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.
Similar Papers
AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts
CV and Pattern Recognition
Finds ways to trick image-makers into making bad pictures.
Automatic LLM Red Teaming
Machine Learning (CS)
Trains AI to find weaknesses in other AI.
RedTeamLLM: an Agentic AI framework for offensive security
Cryptography and Security
AI finds computer weaknesses before hackers do.