Score: 2

Visual Autoregressive Modelling for Monocular Depth Estimation

Published: December 27, 2025 | arXiv ID: 2512.22653v1

By: Amir El-Ghoussani , André Kaup , Nassir Navab and more

Potential Business Impact:

Helps computers guess how far away things are.

Business Areas:

Image Recognition Data and Analytics, Software

We propose a monocular depth estimation method based on visual autoregressive (VAR) priors, offering an alternative to diffusion-based approaches. Our method adapts a large-scale text-to-image VAR model and introduces a scale-wise conditional upsampling mechanism with classifier-free guidance. Our approach performs inference in ten fixed autoregressive stages, requiring only 74K synthetic samples for fine-tuning, and achieves competitive results. We report state-of-the-art performance in indoor benchmarks under constrained training conditions, and strong performance when applied to outdoor datasets. This work establishes autoregressive priors as a complementary family of geometry-aware generative models for depth estimation, highlighting advantages in data scalability, and adaptability to 3D vision tasks. Code available at "https://github.com/AmirMaEl/VAR-Depth".