Score: 0

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Published: November 8, 2025 | arXiv ID: 2511.06146v1

By: Akshar Tumu, Varad Shinde, Parisa Kordjamshidi

Potential Business Impact:

Helps computers understand where things are.

Business Areas:

Visual Search Internet Services

Spatial Reasoning is an important component of human cognition and is an area in which the latest Vision-language models (VLMs) show signs of difficulty. The current analysis works use image captioning tasks and visual question answering. In this work, we propose using the Referring Expression Comprehension task instead as a platform for the evaluation of spatial reasoning by VLMs. This platform provides the opportunity for a deeper analysis of spatial comprehension and grounding abilities when there is 1) ambiguity in object detection, 2) complex spatial expressions with a longer sentence structure and multiple spatial relations, and 3) expressions with negation ('not'). In our analysis, we use task-specific architectures as well as large VLMs and highlight their strengths and weaknesses in dealing with these specific situations. While all these models face challenges with the task at hand, the relative behaviors depend on the underlying models and the specific categories of spatial semantics (topological, directional, proximal, etc.). Our results highlight these challenges and behaviors and provide insight into research gaps and future directions.

Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding

Robotics

Helps robots find specific things in pictures.

18 May 2025 1

90%

Vision-Language Memory for Spatial Reasoning

CV and Pattern Recognition

Robots understand 3D space better from videos.

25 Nov 2025 0

90%

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation

Computation and Language

Helps computers describe pictures like people do.

22 Apr 2025 2

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

13 pages

Referring Expressions as a Lens into Spatial Language Grounding in Vision-Language Models

Helps computers understand where things are.

Technical Abstract

Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding

Vision-Language Memory for Spatial Reasoning

Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation