From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs
By: Mingrui Wu , Zhaozhi Wang , Fangjinhua Wang and more
While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.
Similar Papers
Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes
Machine Learning (CS)
Teaches AI to understand where things are.
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Artificial Intelligence
Tests how well computers understand space and plan.
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models
CV and Pattern Recognition
Teaches computers to understand 3D objects from different views.