UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model
By: Changxin Huang , Lv Tang , Zhaohuan Zhan and more
Potential Business Impact:
Helps robots understand where to go using sight and words.
Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instruction--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives. To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer's fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM's reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.
Similar Papers
A Navigation Framework Utilizing Vision-Language Models
Robotics
Helps robots follow spoken directions in new places.
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Artificial Intelligence
Helps robots understand places better to find their way.
Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
Artificial Intelligence
Helps robots learn to navigate new places by imagining.