Image Recognition with Vision and Language Embeddings of VLMs
By: Illia Volkov , Nikita Kisel , Klara Janouskova and more
Potential Business Impact:
Helps computers understand pictures better with words or just sight.
Vision-language models (VLMs) have enabled strong zero-shot classification through image-text alignment. Yet, their purely visual inference capabilities remain under-explored. In this work, we conduct a comprehensive evaluation of both language-guided and vision-only image classification with a diverse set of dual-encoder VLMs, including both well-established and recent models such as SigLIP 2 and RADIOv2.5. The performance is compared in a standard setup on the ImageNet-1k validation set and its label-corrected variant. The key factors affecting accuracy are analysed, including prompt design, class diversity, the number of neighbours in k-NN, and reference set size. We show that language and vision offer complementary strengths, with some classes favouring textual prompts and others better handled by visual similarity. To exploit this complementarity, we introduce a simple, learning-free fusion method based on per-class precision that improves classification performance. The code is available at: https://github.com/gonikisgo/bmvc2025-vlm-image-recognition.
Similar Papers
Zero-shot image privacy classification with Vision-Language Models
CV and Pattern Recognition
Makes computers better at guessing private pictures.
Improving Visual Recommendation on E-commerce Platforms Using Vision-Language Models
Information Retrieval
Finds better products you'll like to buy.
A Survey on Efficient Vision-Language Models
CV and Pattern Recognition
Makes smart AI work on small, slow devices.