Map2Thought: Explicit 3D Spatial Reasoning via Metric Cognitive Maps
By: Xiangjun Gao , Zhensong Zhang , Dave Zhenyu Chen and more
Potential Business Impact:
Helps robots understand and navigate 3D spaces.
We propose Map2Thought, a framework that enables explicit and interpretable spatial reasoning for 3D VLMs. The framework is grounded in two key components: Metric Cognitive Map (Metric-CogMap) and Cognitive Chain-of-Thought (Cog-CoT). Metric-CogMap provides a unified spatial representation by integrating a discrete grid for relational reasoning with a continuous, metric-scale representation for precise geometric understanding. Building upon the Metric-CogMap, Cog-CoT performs explicit geometric reasoning through deterministic operations, including vector operations, bounding-box distances, and occlusion-aware appearance order cues, producing interpretable inference traces grounded in 3D structure. Experimental results show that Map2Thought enables explainable 3D understanding, achieving 59.9% accuracy using only half the supervision, closely matching the 60.9% baseline trained with the full dataset. It consistently outperforms state-of-the-art methods by 5.3%, 4.8%, and 4.0% under 10%, 25%, and 50% training subsets, respectively, on the VSI-Bench.
Similar Papers
Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning
CV and Pattern Recognition
Helps computers understand 3D space from videos.
CogniMap3D: Cognitive 3D Mapping and Rapid Retrieval
CV and Pattern Recognition
Helps robots remember and understand places they visit.
Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views
CV and Pattern Recognition
Lets computers imagine 3D shapes from pictures.