VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents
By: Xunyi Zhao, Gengze Zhou, Qi Wu
Potential Business Impact:
Helps robots understand and move in new places.
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a wide range of vision-language tasks. However, their performance as embodied agents, which requires multi-round dialogue spatial reasoning and sequential action prediction, needs further exploration. Our work investigates this potential in the context of Vision-and-Language Navigation (VLN) by introducing a unified and extensible evaluation framework to probe MLLMs as zero-shot agents by bridging traditional navigation datasets into a standardized benchmark, named VLN-MME. We simplify the evaluation with a highly modular and accessible design. This flexibility streamlines experiments, enabling structured comparisons and component-level ablations across diverse MLLM architectures, agent designs, and navigation tasks. Crucially, enabled by our framework, we observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-reflection leads to an unexpected performance decrease. This suggests MLLMs exhibit poor context awareness in embodied navigation tasks; although they can follow instructions and structure their output, their 3D spatial reasoning fidelity is low. VLN-MME lays the groundwork for systematic evaluation of general-purpose MLLMs in embodied navigation settings and reveals limitations in their sequential decision-making capabilities. We believe these findings offer crucial guidance for MLLM post-training as embodied agents.
Similar Papers
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Artificial Intelligence
Helps robots understand places better to find their way.
City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs
CV and Pattern Recognition
Helps robots find their way using only their eyes.
Think, Remember, Navigate: Zero-Shot Object-Goal Navigation with VLM-Powered Reasoning
Robotics
Helps robots explore new places much faster.