Score: 1

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Published: June 1, 2025 | arXiv ID: 2506.00782v1

By: Weiyang Guo , Zesheng Shi , Zhuo Li and more

Potential Business Impact:

Finds ways to make AI safer and more helpful.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) grow in power and influence, ensuring their safety and preventing harmful output becomes critical. Automated red teaming serves as a tool to detect security vulnerabilities in LLMs without manual labor. However, most existing methods struggle to balance the effectiveness and diversity of red-team generated attack prompts. To address this challenge, we propose \ourapproach, a novel automated red teaming training framework that utilizes reinforcement learning to explore and generate more effective attack prompts while balancing their diversity. Specifically, it consists of three training stages: (1) Cold Start: The red team model is supervised and fine-tuned on a jailbreak dataset obtained through imitation learning. (2) Warm-up Exploration: The model is trained in jailbreak instruction following and exploration, using diversity and consistency as reward signals. (3) Enhanced Jailbreak: Progressive jailbreak rewards are introduced to gradually enhance the jailbreak performance of the red-team model. Extensive experiments on a variety of LLMs show that \ourapproach effectively balances the diversity and effectiveness of jailbreak prompts compared to existing methods. Our work significantly improves the efficiency of red team exploration and provides a new perspective on automated red teaming.

Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs

Cryptography and Security

Stops AI from being tricked into bad things.

7 May 2025 1

91%

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Artificial Intelligence

Teaches computers to trick other computers into saying bad things.

8 Dec 2025 2

90%

Automatic LLM Red Teaming

Machine Learning (CS)

Trains AI to find weaknesses in other AI.

6 Aug 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

21 pages

Jailbreak-R1: Exploring the Jailbreak Capabilities of LLMs via Reinforcement Learning

Finds ways to make AI safer and more helpful.

Technical Abstract

Red Teaming the Mind of the Machine: A Systematic Evaluation of Prompt Injection and Jailbreak Vulnerabilities in LLMs

RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models

Automatic LLM Red Teaming