SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs
By: Juan Ren, Mark Dras, Usman Naseem
Potential Business Impact:
Stops AI from being tricked by bad instructions.
Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.
Similar Papers
A Call to Action for a Secure-by-Design Generative AI Paradigm
Cryptography and Security
Protects smart programs from being tricked.
PSM: Prompt Sensitivity Minimization via LLM-Guided Black-Box Optimization
Cryptography and Security
Protects secret AI instructions from being stolen.
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization
Machine Learning (CS)
Makes AI safer by understanding pictures and words.