Can Argus Judge Them All? Comparing VLMs Across Domains
By: Harsh Joshi , Gautam Siddharth Kashyap , Rafiq Ali and more
Potential Business Impact:
Makes AI understand pictures and words better.
Vision-Language Models (VLMs) are advancing multimodal AI, yet their performance consistency across tasks is underexamined. We benchmark CLIP, BLIP, and LXMERT across diverse datasets spanning retrieval, captioning, and reasoning. Our evaluation includes task accuracy, generation quality, efficiency, and a novel Cross-Dataset Consistency (CDC) metric. CLIP shows strongest generalization (CDC: 0.92), BLIP excels on curated data, and LXMERT leads in structured reasoning. These results expose trade-offs between generalization and specialization, informing industrial deployment of VLMs and guiding development toward robust, task-flexible architectures.
Similar Papers
How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study
CV and Pattern Recognition
Helps computers understand medical pictures better.
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
CV and Pattern Recognition
Teaches computers to understand logic in pictures.
A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
CV and Pattern Recognition
Lets computers understand pictures and words together.