Same or Not? Enhancing Visual Perception in Vision-Language Models
By: Damiano Marsili , Aditya Mehta , Ryan Y. Lin and more
Potential Business Impact:
Teaches computers to spot tiny differences in pictures.
Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/
Similar Papers
Infusing fine-grained visual knowledge to Vision-Language Models
CV and Pattern Recognition
Keeps AI smart while teaching new skills.
Image Recognition with Vision and Language Embeddings of VLMs
CV and Pattern Recognition
Helps computers understand pictures better with words or just sight.
From Synthetic Scenes to Real Performance: Enhancing Spatial Reasoning in VLMs
CV and Pattern Recognition
Makes AI understand pictures better, without mistakes.