Analyzing the Sensitivity of Vision Language Models in Visual Question Answering
By: Monika Shah , Sudarshan Balaji , Somdeb Sarkhel and more
Potential Business Impact:
Helps AI understand tricky questions like people.
We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice's maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice's maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs.
Similar Papers
VLMs Guided Interpretable Decision Making for Autonomous Driving
CV and Pattern Recognition
Helps self-driving cars make safer, clearer choices.
Are Large Vision Language Models Truly Grounded in Medical Images? Evidence from Italian Clinical Visual Question Answering
CV and Pattern Recognition
Computers sometimes guess answers without looking.
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models
Artificial Intelligence
Makes AI understand pictures and facts better.