QDepth-VLA: Quantized Depth Prediction as Auxiliary Supervision for Vision-Language-Action Models
By: Yixuan Li , Yuhui Chen , Mingcai Zhou and more
Potential Business Impact:
Helps robots understand 3D space for better tasks.
Spatial perception and reasoning are crucial for Vision-Language-Action (VLA) models to accomplish fine-grained manipulation tasks. However, existing approaches often lack the ability to understand and reason over the essential 3D structures necessary for precise control. To address this limitation, we propose QDepth-VLA, a general framework that augments VLA models with an auxiliary depth prediction task. A dedicated depth expert is designed to predict quantized latent tokens of depth maps obtained from a VQ-VAE encoder, enabling the model to learn depth-aware representations that capture critical geometric cues. Experimental results on the simulation benchmarks and real-world tasks demonstrate that QDepth-VLA yields strong spatial reasoning and competitive performance on manipulation tasks.
Similar Papers
DepthVLA: Enhancing Vision-Language-Action Models with Depth-Aware Spatial Reasoning
CV and Pattern Recognition
Helps robots understand where things are better.
SQAP-VLA: A Synergistic Quantization-Aware Pruning Framework for High-Performance Vision-Language-Action Models
CV and Pattern Recognition
Makes smart robots faster and use less power.
GeoVLA: Empowering 3D Representations in Vision-Language-Action Models
Robotics
Robots understand 3D space to do tasks better.