Score: 1

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Published: November 13, 2025 | arXiv ID: 2511.10376v1

By: Xun Huang , Shijia Zhao , Yunxiang Wang and more

Potential Business Impact:

Robots learn to explore new places without practice.

Business Areas:

Navigation Navigation and Mapping

Embodied navigation is a fundamental capability for robotic agents operating. Real-world deployment requires open vocabulary generalization and low training overhead, motivating zero-shot methods rather than task-specific RL training. However, existing zero-shot methods that build explicit 3D scene graphs often compress rich visual observations into text-only relations, leading to high construction cost, irreversible loss of visual evidence, and constrained vocabularies. To address these limitations, we introduce the Multi-modal 3D Scene Graph (M3DSG), which preserves visual cues by replacing textual relational edges with dynamically assigned images. Built on M3DSG, we propose MSGNav, a zero-shot navigation system that includes a Key Subgraph Selection module for efficient reasoning, an Adaptive Vocabulary Update module for open vocabulary support, and a Closed-Loop Reasoning module for accurate exploration reasoning. Additionally, we further identify the last-mile problem in zero-shot navigation - determining the feasible target location with a suitable final viewpoint, and propose a Visibility-based Viewpoint Decision module to explicitly resolve it. Comprehensive experimental results demonstrate that MSGNav achieves state-of-the-art performance on GOAT-Bench and HM3D-OVON datasets. The open-source code will be publicly available.

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

CV and Pattern Recognition

Robots learn to navigate new places without prior training.

13 Nov 2025 1

91%

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

CV and Pattern Recognition

Helps robots explore new places without prior maps.

27 Nov 2025 0

90%

View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

CV and Pattern Recognition

Helps robots find objects using words.

10 Dec 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

Robots learn to explore new places without practice.

Technical Abstract

MSGNav: Unleashing the Power of Multi-modal 3D Scene Graph for Zero-Shot Embodied Navigation

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs