Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models: A Unified and Accurate Approach
By: Shuang Liang , Zhihao Xu , Jialing Tao and more
Potential Business Impact:
Stops AI from being tricked by bad questions.
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. Although recent detection works have shifted to internal representations due to their rich cross-modal information, most methods rely on heuristic rules rather than principled objectives, resulting in suboptimal performance. To address these limitations, we propose Learning to Detect (LoD), a novel unsupervised framework that formulates jailbreak detection as anomaly detection. LoD introduces two key components: Multi-modal Safety Concept Activation Vectors (MSCAV), which capture layer-wise safety-related representations across modalities, and the Safety Pattern Auto-Encoder, which models the distribution of MSCAV derived from safe inputs and detects anomalies via reconstruction errors. By training the auto-encoder (AE) solely on safe samples without attack labels, LoD naturally identifies jailbreak inputs as distributional anomalies, enabling accurate and unified detection of jailbreak attacks. Comprehensive experiments on three different LVLMs and five benchmarks demonstrate that LoD achieves state-of-the-art performance, with an average AUROC of 0.9951 and an improvement of up to 38.89% in the minimum AUROC over the strongest baselines.
Similar Papers
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
CV and Pattern Recognition
Stops AI from being tricked into bad things.
Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring
Cryptography and Security
Stops AI from being tricked by bad questions.
JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model
Cryptography and Security
Stops AI from making bad or unsafe stuff.