Score: 1

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Published: October 16, 2025 | arXiv ID: 2510.15068v1

By: Deyue Zhang , Dongdong Yang , Junjie Mu and more

Potential Business Impact:

Makes AI models say bad things using pictures.

Business Areas:

Text Analytics Data and Analytics, Software

Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs. Our method decomposes malicious queries into visually innocuous storytelling elements using an auxiliary LLM, generates corresponding image sequences through diffusion models, and exploits the models' reliance on narrative coherence to elicit harmful outputs. Extensive experiments on harmful textual queries from established safety benchmarks show that our approach achieves an average attack success rate of 83.5\%, surpassing prior state-of-the-art by 46\%. Compared with existing visual jailbreak methods, our sequential narrative strategy demonstrates superior effectiveness across diverse categories of harmful content. We further analyze attack patterns, uncover key vulnerability factors in multimodal safety mechanisms, and evaluate the limitations of current defense strategies against narrative-driven attacks, revealing significant gaps in existing protections.

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

Cryptography and Security

Finds ways to trick smart AI with pictures.

24 Oct 2025 0

91%

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

CV and Pattern Recognition

Stops AI from being tricked into saying bad things.

5 Dec 2025 0

91%

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations

Cryptography and Security

Tricks AI into showing bad stuff using pictures.

23 Oct 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Makes AI models say bad things using pictures.

Technical Abstract

Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack

Beyond Text: Multimodal Jailbreaking of Vision-Language and Audio Models through Perceptually Simple Transformations