Score: 2

Video4Spatial: Towards Visuospatial Intelligence with Context-Guided Video Generation

Published: December 2, 2025 | arXiv ID: 2512.03040v1

By: Zeqi Xiao , Yiwei Zhao , Lingxiao Li and more

BigTech Affiliations: Netflix

Potential Business Impact:

Teaches computers to understand space from videos.

Business Areas:
Image Recognition Data and Analytics, Software

We investigate whether video generative models can exhibit visuospatial intelligence, a capability central to human cognition, using only visual data. To this end, we present Video4Spatial, a framework showing that video diffusion models conditioned solely on video-based scene context can perform complex spatial tasks. We validate on two tasks: scene navigation - following camera-pose instructions while remaining consistent with 3D geometry of the scene, and object grounding - which requires semantic localization, instruction following, and planning. Both tasks use video-only inputs, without auxiliary modalities such as depth or poses. With simple yet effective design choices in the framework and data curation, Video4Spatial demonstrates strong spatial understanding from video context: it plans navigation and grounds target objects end-to-end, follows camera-pose instructions while maintaining spatial consistency, and generalizes to long contexts and out-of-domain environments. Taken together, these results advance video generative models toward general visuospatial reasoning.

Country of Origin
πŸ‡¬πŸ‡§ πŸ‡ΈπŸ‡¬ πŸ‡ΊπŸ‡Έ United States, United Kingdom, Singapore

Page Count
16 pages

Category
Computer Science:
CV and Pattern Recognition