Score: 0

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Published: January 11, 2026 | arXiv ID: 2601.06806v1

By: Jiwen Zhang , Zejun Li , Siyuan Wang and more

Although learning-based vision-and-language navigation (VLN) agents can learn spatial knowledge implicitly from large-scale training data, zero-shot VLN agents lack this process, relying primarily on local observations for navigation, which leads to inefficient exploration and a significant performance gap. To deal with the problem, we consider a zero-shot VLN setting that agents are allowed to fully explore the environment before task execution. Then, we construct the Spatial Scene Graph (SSG) to explicitly capture global spatial structure and semantics in the explored environment. Based on the SSG, we introduce SpatialNav, a zero-shot VLN agent that integrates an agent-centric spatial map, a compass-aligned visual representation, and a remote object localization strategy for efficient navigation. Comprehensive experiments in both discrete and continuous environments demonstrate that SpatialNav significantly outperforms existing zero-shot agents and clearly narrows the gap with state-of-the-art learning-based methods. Such results highlight the importance of global spatial representations for generalizable navigation.

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

CV and Pattern Recognition

Helps robots follow directions and remember places.

20 Aug 2025 1

92%

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

Artificial Intelligence

Helps robots follow directions in new places.

11 Aug 2025 2

91%

DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation

Robotics

Robot learns to follow directions by imagining paths.

14 Sep 2025 1

View PDF Login to Bookmark

SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation

Technical Abstract

MSNav: Zero-Shot Vision-and-Language Navigation with Dynamic Memory and LLM Spatial Reasoning

Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents

DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation