The Spatial Blindspot of Vision-Language Models
By: Nahid Alam , Leema Krishna Murali , Siddhant Bharadwaj and more
Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.
Similar Papers
Vision-Language Memory for Spatial Reasoning
CV and Pattern Recognition
Robots understand 3D space better from videos.
Vision language models are unreliable at trivial spatial cognition
CV and Pattern Recognition
Computers struggle to tell what's left or right.
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
CV and Pattern Recognition
Teaches computers to understand object positions better.