Score: 2

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Published: March 24, 2025 | arXiv ID: 2503.19134v1

By: Wenhao You , Bryan Hooi , Yiwei Wang and more

Potential Business Impact:

Tricks AI into saying bad things using stories.

Business Areas:

Guides Media and Entertainment

While safety mechanisms have significantly progressed in filtering harmful text inputs, MLLMs remain vulnerable to multimodal jailbreaks that exploit their cross-modal reasoning capabilities. We present MIRAGE, a novel multimodal jailbreak framework that exploits narrative-driven context and role immersion to circumvent safety mechanisms in Multimodal Large Language Models (MLLMs). By systematically decomposing the toxic query into environment, role, and action triplets, MIRAGE constructs a multi-turn visual storytelling sequence of images and text using Stable Diffusion, guiding the target model through an engaging detective narrative. This process progressively lowers the model's defences and subtly guides its reasoning through structured contextual cues, ultimately eliciting harmful responses. In extensive experiments on the selected datasets with six mainstream MLLMs, MIRAGE achieves state-of-the-art performance, improving attack success rates by up to 17.5% over the best baselines. Moreover, we demonstrate that role immersion and structured semantic reconstruction can activate inherent model biases, facilitating the model's spontaneous violation of ethical safeguards. These results highlight critical weaknesses in current multimodal safety mechanisms and underscore the urgent need for more robust defences against cross-modal threats.

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

Cryptography and Security

Makes AI models with pictures unsafe.

2 Jun 2025 0

90%

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Cryptography and Security

Tricks AI into showing bad stuff using pictures.

23 Oct 2025 2

90%

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models

Machine Learning (CS)

Makes AI models do bad things hidden in pictures.

22 May 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

17 pages

MIRAGE: Multimodal Immersive Reasoning and Guided Exploration for Red-Team Jailbreak Attacks

Tricks AI into saying bad things using stories.

Technical Abstract

Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Implicit Jailbreak Attacks via Cross-Modal Information Concealment on Vision-Language Models