Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models
By: Shizhan Gong , Yankai Jiang , Qi Dou and more
Potential Business Impact:
Makes computers see details better for smarter answers.
Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder retains compatibility with the frozen text encoder and exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning, and localization. By integrating the aligned visual encoder, downstream MLLMs also demonstrate enhanced performance.
Similar Papers
Data or Language Supervision: What Makes CLIP Better than DINO?
CV and Pattern Recognition
Makes AI understand pictures and words better.
CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification
CV and Pattern Recognition
Finds people in photos even in new places.
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
CV and Pattern Recognition
Helps robots grab things by seeing better.