Evaluating Robustness of Vision-Language Models Under Noisy Conditions
By: Purushoth, Alireza
Potential Business Impact:
Tests how well AI sees and understands pictures.
Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.
Similar Papers
Analysing the Robustness of Vision-Language-Models to Common Corruptions
CV and Pattern Recognition
Makes AI understand pictures even when they're messy.
Coordinated Robustness Evaluation Framework for Vision-Language Models
CV and Pattern Recognition
Makes AI models fooled by tricky pictures and words.
Are vision language models robust to uncertain inputs?
CV and Pattern Recognition
Makes AI admit when it doesn't know.