Score: 2

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Published: May 16, 2025 | arXiv ID: 2505.11383v1

By: Zihan Wang, Seungjun Lee, Gim Hee Lee

Potential Business Impact:

Helps robots explore and remember places better.

Business Areas:

3D Technology Hardware, Software

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments.To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction. Given posed RGB-D images, our Dynam3D projects 2D CLIP features into 3D space and constructs multi-level 3D patch-instance-zone representations for 3D geometric and semantic understanding with a dynamic and layer-wise update strategy. Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation. By leveraging large-scale 3D-language pretraining and task-specific adaptation, our Dynam3D sets new state-of-the-art performance on VLN benchmarks including R2R-CE, REVERIE-CE and NavRAG-CE under monocular settings. Furthermore, experiments for pre-exploration, lifelong memory, and real-world robot validate the effectiveness of practical deployment.

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

CV and Pattern Recognition

Helps robots understand and navigate 3D worlds.

14 Dec 2025 2

92%

DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory

Robotics

Robots learn to explore new places by seeing and hearing.

18 Jun 2025 0

91%

A Navigation Framework Utilizing Vision-Language Models

Robotics

Helps robots follow spoken directions in new places.

11 Jun 2025 0

View PDF Login to Bookmark

Country of Origin

🇸🇬 Singapore

Repos / Data Links

github.com

Page Count

14 pages

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Helps robots explore and remember places better.

Technical Abstract

D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation

DyNaVLM: Zero-Shot Vision-Language Navigation System with Dynamic Viewpoints and Self-Refining Graph Memory

A Navigation Framework Utilizing Vision-Language Models