Score: 0

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Published: January 15, 2026 | arXiv ID: 2601.10313v1

By: Peng-Fei Zhang, Zi Huang

Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.

When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

CV and Pattern Recognition

Makes robots easily fooled by fake pictures.

26 Nov 2025 0

90%

HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

CV and Pattern Recognition

Tricks AI into giving wrong answers with hidden image changes.

19 Nov 2025 0

90%

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models

CV and Pattern Recognition

Makes robots understand and obey commands better.

20 Nov 2025 0

View PDF Login to Bookmark

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Technical Abstract

When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

HV-Attack: Hierarchical Visual Attack for Multimodal Retrieval Augmented Generation

When Alignment Fails: Multimodal Adversarial Attacks on Vision-Language-Action Models