Score: 0

Vision-Language Memory for Spatial Reasoning

Published: November 25, 2025 | arXiv ID: 2511.20644v1

By: Zuntao Liu , Yi Du , Taimeng Fu and more

Potential Business Impact:

Robots understand 3D space better from videos.

Business Areas:

Image Recognition Data and Analytics, Software

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM$^2$, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and long-horizon spatial reasoning with a fixed computational cost. Extensive experiments on multiple benchmarks show that VLM$^2$ achieves state-of-the-art performance among video-only models, significantly advancing the frontier of visual-spatial intelligence.

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

Artificial Intelligence

Helps computers understand 3D spaces from different views.

2 Dec 2025 0

93%

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Teaches computers to understand 3D space from pictures.

26 Nov 2025 1

93%

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Helps computers understand 3D space from pictures.

26 Nov 2025 1

View PDF Login to Bookmark

Page Count

20 pages

Vision-Language Memory for Spatial Reasoning

Robots understand 3D space better from videos.

Technical Abstract

Reasoning Path and Latent State Analysis for Multi-view Visual Spatial Reasoning: A Cognitive Science Perspective

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning