Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
By: Wentao Zhou , Xuweiyi Chen , Vignesh Rajagopal and more
Potential Business Impact:
Helps robots navigate better with two eyes.
The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
Similar Papers
MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming
CV and Pattern Recognition
Helps robots navigate using just a single camera.
SLAM-Free Visual Navigation with Hierarchical Vision-Language Perception and Coarse-to-Fine Semantic Topological Planning
Robotics
Robots learn to explore using words and pictures.
Fast-SmartWay: Panoramic-Free End-to-End Zero-Shot Vision-and-Language Navigation
Robotics
Helps robots navigate using only a few pictures.