Seeing the Threat: Vulnerabilities in Vision-Language Models to Adversarial Attack
By: Juan Ren, Mark Dras, Usman Naseem
Potential Business Impact:
Makes AI safer from bad instructions.
Large Vision-Language Models (LVLMs) have shown remarkable capabilities across a wide range of multimodal tasks. However, their integration of visual inputs introduces expanded attack surfaces, thereby exposing them to novel security vulnerabilities. In this work, we conduct a systematic representational analysis to uncover why conventional adversarial attacks can circumvent the safety mechanisms embedded in LVLMs. We further propose a novel two stage evaluation framework for adversarial attacks on LVLMs. The first stage differentiates among instruction non compliance, outright refusal, and successful adversarial exploitation. The second stage quantifies the degree to which the model's output fulfills the harmful intent of the adversarial prompt, while categorizing refusal behavior into direct refusals, soft refusals, and partial refusals that remain inadvertently helpful. Finally, we introduce a normative schema that defines idealized model behavior when confronted with harmful prompts, offering a principled target for safety alignment in multimodal systems.
Similar Papers
Transferable Adversarial Attacks on Black-Box Vision-Language Models
CV and Pattern Recognition
Makes AI misinterpret pictures to trick it.
Attention! You Vision Language Model Could Be Maliciously Manipulated
CV and Pattern Recognition
Makes AI see and follow bad instructions.
A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications
Computers and Society
Finds weak spots in AI that sees and reads.