Score: 1

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Published: December 1, 2025 | arXiv ID: 2512.01821v1

By: Meng Cao , Haokun Lin , Haoyuan Li and more

Potential Business Impact:

Teaches computers to see and understand 3D space.

Business Areas:

Visual Search Internet Services

Spatial reasoning, the ability to understand and interpret the 3D structure of the world, is a critical yet underdeveloped capability in Multimodal Large Language Models (MLLMs). Current methods predominantly rely on verbal descriptive tuning, which suffers from visual illiteracy, i.e., they learn spatial concepts through textual symbols alone, devoid of connection to their visual manifestations. To bridge this gap, this paper introduces MILO, an Implicit spatIaL wOrld modeling paradigm that simulates human-like spatial imagination. MILO integrates a visual generator to provide geometry-aware feedback, thereby implicitly grounding the MLLM's symbolic reasoning in perceptual experience. Complementing this paradigm, we propose RePE (Relative Positional Encoding), a novel encoding scheme that captures relative camera-pose transformations, offering superior performance over absolute coordinate systems. To support the training, we construct GeoGen, a large-scale Geometry-aware Generative dataset with approximately 2,241 videos and 67,827 observation-action-outcome triplets. Experiments demonstrate that our approach significantly enhances spatial reasoning capabilities across multiple baselines and benchmarks, offering a more holistic understanding of 3D space.

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

CV and Pattern Recognition

Helps computers understand 3D shapes and where things are.

21 Nov 2025 1

91%

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

CV and Pattern Recognition

Lets computers imagine and solve puzzles like humans.

8 Dec 2025 1

90%

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

CV and Pattern Recognition

Helps computers understand 3D space from videos.

20 Nov 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

15 pages

Seeing through Imagination: Learning Scene Geometry via Implicit Spatial World Modeling

Teaches computers to see and understand 3D space.

Technical Abstract

SpatialGeo:Boosting Spatial Reasoning in Multimodal LLMs via Geometry-Semantics Fusion

SpatialDreamer: Incentivizing Spatial Reasoning via Active Mental Imagery

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning