Score: 0

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Published: December 9, 2025 | arXiv ID: 2512.08700v1

By: Kyumin Hwang , Wonhyeok Choi , Kiljoon Han and more

Recent foundation models demonstrate strong generalization capabilities in monocular depth estimation. However, directly applying these models to Full Surround Monocular Depth Estimation (FSMDE) presents two major challenges: (1) high computational cost, which limits real-time performance, and (2) difficulty in estimating metric-scale depth, as these models are typically trained to predict only relative depth. To address these limitations, we propose a novel knowledge distillation strategy that transfers robust depth knowledge from a foundation model to a lightweight FSMDE network. Our approach leverages a hybrid regression framework combining the knowledge distillation scheme--traditionally used in classification--with a depth binning module to enhance scale consistency. Specifically, we introduce a cross-interaction knowledge distillation scheme that distills the scale-invariant depth bin probabilities of a foundation model into the student network while guiding it to infer metric-scale depth bin centers from ground-truth depth. Furthermore, we propose view-relational knowledge distillation, which encodes structural relationships among adjacent camera views and transfers them to enhance cross-view depth consistency. Experiments on DDAD and nuScenes demonstrate the effectiveness of our method compared to conventional supervised methods and existing knowledge distillation approaches. Moreover, our method achieves a favorable trade-off between performance and efficiency, meeting real-time requirements.

Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

CV and Pattern Recognition

Makes 3D pictures more real, near and far.

13 Nov 2025 0

90%

EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images

CV and Pattern Recognition

Helps doctors see inside bodies better.

25 Aug 2025 1

90%

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation

CV and Pattern Recognition

Makes 3D pictures from many cameras match perfectly.

20 Nov 2025 1

View PDF Login to Bookmark

Scale-invariant and View-relational Representation Learning for Full Surround Monocular Depth

Technical Abstract

Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

EndoUFM: Utilizing Foundation Models for Monocular depth estimation of endoscopic images

CylinderDepth: Cylindrical Spatial Attention for Multi-View Consistent Self-Supervised Surround Depth Estimation