Score: 1

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Published: November 26, 2025 | arXiv ID: 2511.21688v1

By: Wenbo Hu , Jingli Lin , Yilin Long and more

Potential Business Impact:

Helps computers understand 3D space from pictures.

Business Areas:

Image Recognition Data and Analytics, Software

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

CV and Pattern Recognition

Teaches computers to understand 3D space from pictures.

26 Nov 2025 1

94%

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

CV and Pattern Recognition

Makes 3D pictures match words better.

18 Nov 2025 1

93%

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models

CV and Pattern Recognition

Helps computers understand 3D objects and their places.

18 Dec 2025 0

View PDF Login to Bookmark

Page Count

14 pages

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Helps computers understand 3D space from pictures.

Technical Abstract

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models