Visual Autoregressive Modelling for Monocular Depth Estimation
By: Amir El-Ghoussani , André Kaup , Nassir Navab and more
Potential Business Impact:
Helps computers guess how far away things are.
We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".
Similar Papers
Visual Autoregressive Modeling for Instruction-Guided Image Editing
CV and Pattern Recognition
Edits pictures perfectly, following your exact words.
Seg-VAR: Image Segmentation with Visual Autoregressive Modeling
CV and Pattern Recognition
Makes computers perfectly outline any object in pictures.
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation
CV and Pattern Recognition
Cleans up dark, blurry, noisy pictures automatically.