CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
By: Xu Zhang, Hao Li, Zhichao Lu
Potential Business Impact:
Protects smart AI from tricky hidden attacks.
Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
Similar Papers
Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks
Cryptography and Security
Makes AI models safer from harmful tricks.
Enhanced MLLM Black-Box Jailbreaking Attacks and Defenses
Cryptography and Security
Finds ways to trick smart AI with pictures.
Align is not Enough: Multimodal Universal Jailbreak Attack against Multimodal Large Language Models
Cryptography and Security
Makes AI models with pictures unsafe.