DreamNav: A Trajectory-Based Imaginative Framework for Zero-Shot Vision-and-Language Navigation
By: Yunheng Wang , Yuetong Fang , Taowen Wang and more
Potential Business Impact:
Robot learns to follow directions by imagining paths.
Vision-and-Language Navigation in Continuous Environments (VLN-CE), which links language instructions to perception and control in the real world, is a core capability of embodied robots. Recently, large-scale pretrained foundation models have been leveraged as shared priors for perception, reasoning, and action, enabling zero-shot VLN without task-specific training. However, existing zero-shot VLN methods depend on costly perception and passive scene understanding, collapsing control to point-level choices. As a result, they are expensive to deploy, misaligned in action semantics, and short-sighted in planning. To address these issues, we present DreamNav that focuses on the following three aspects: (1) for reducing sensory cost, our EgoView Corrector aligns viewpoints and stabilizes egocentric perception; (2) instead of point-level actions, our Trajectory Predictor favors global trajectory-level planning to better align with instruction semantics; and (3) to enable anticipatory and long-horizon planning, we propose an Imagination Predictor to endow the agent with proactive thinking capability. On VLN-CE and real-world tests, DreamNav sets a new zero-shot state-of-the-art (SOTA), outperforming the strongest egocentric baseline with extra information by up to 7.49\% and 18.15\% in terms of SR and SPL metrics. To our knowledge, this is the first zero-shot VLN method to unify trajectory-level planning and active imagination while using only egocentric inputs.
Similar Papers
SoraNav: Adaptive UAV Task-Centric Navigation via Zeroshot VLM Reasoning
Robotics
Drones follow spoken directions in 3D spaces.
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
Artificial Intelligence
Helps robots follow directions in new places.
VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation
Robotics
Helps robots learn new places without practice.