Vision-Language Embodiment for Monocular Depth Estimation
By: Jinchang Zhang, Guoyu Lu
Potential Business Impact:
Helps robots see in 3D using just one camera.
Depth estimation is a core problem in robotic perception and vision tasks, but 3D reconstruction from a single image presents inherent uncertainties. Current depth estimation models primarily rely on inter-image relationships for supervised training, often overlooking the intrinsic information provided by the camera itself. We propose a method that embodies the camera model and its physical characteristics into a deep learning model, computing embodied scene depth through real-time interactions with road environments. The model can calculate embodied scene depth in real-time based on immediate environmental changes using only the intrinsic properties of the camera, without any additional equipment. By combining embodied scene depth with RGB image features, the model gains a comprehensive perspective on both geometric and visual details. Additionally, we incorporate text descriptions containing environmental content and depth information as priors for scene understanding, enriching the model's perception of objects. This integration of image and language - two inherently ambiguous modalities - leverages their complementary strengths for monocular depth estimation. The real-time nature of the embodied language and depth prior model ensures that the model can continuously adjust its perception and behavior in dynamic environments. Experimental results show that the embodied depth estimation method enhances model performance across different scenes.
Similar Papers
Depth as Points: Center Point-based Depth Estimation
CV and Pattern Recognition
Helps self-driving cars see better and faster.
Multi-view Reconstruction via SfM-guided Monocular Depth Estimation
CV and Pattern Recognition
Makes 3D pictures from many photos.
Dense Geometry Supervision for Underwater Depth Estimation
CV and Pattern Recognition
Helps cameras see clearly underwater.