Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture
By: Wanyue Zhang , Yibin Huang , Yangbin Xu and more
Potential Business Impact:
Helps robots understand where things are.
Spatial understanding is essential for Multimodal Large Language Models (MLLMs) to support perception, reasoning, and planning in embodied environments. Despite recent progress, existing studies reveal that MLLMs still struggle with spatial understanding. However, existing research lacks a comprehensive and systematic evaluation of these limitations, often restricted to isolated scenarios, such as single-view or video. In this work, we present a systematic analysis of spatial understanding from both data and architectural perspectives across three representative scenarios: single-view, multi-view, and video. We propose a benchmark named MulSeT (Multi-view Spatial Understanding Tasks), and design a series of experiments to analyze the spatial reasoning capabilities of MLLMs. From the data perspective, the performance of spatial understanding converges quickly as the training data increases, and the upper bound is relatively low, especially for tasks that require spatial imagination. This indicates that merely expanding training data is insufficient to achieve satisfactory performance. From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model, in both cascaded and native MLLMs. Moreover, we explore reasoning injection and envision future improvements through architectural design to optimize spatial understanding. These insights shed light on the limitations of current MLLMs and suggest new directions for improving spatial reasoning capabilities through data scaling and architectural tuning.
Similar Papers
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes
Machine Learning (CS)
Teaches AI to understand where things are.
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
CV and Pattern Recognition
Teaches computers to understand 3D objects from different views.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Artificial Intelligence
Tests how well computers understand space and plan.