VIMD: Monocular Visual-Inertial Motion and Depth Estimation
By: Saimouli Katragadda, Guoquan Huang
Potential Business Impact:
Helps robots see in 3D with just one camera.
Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.
Similar Papers
Zero-Shot Metric Depth Estimation via Monocular Visual-Inertial Rescaling for Autonomous Aerial Navigation
Robotics
Helps drones see how far things are.
Vision-Language Embodiment for Monocular Depth Estimation
CV and Pattern Recognition
Helps robots see in 3D using just one camera.
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
CV and Pattern Recognition
Makes 3D pictures from many photos.