Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
By: Valentina Lilova , Toyesh Chakravorty , Julian I. Bibo and more
Potential Business Impact:
Tests how well computers understand 3D objects from pictures.
Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .
Similar Papers
Emergent Extreme-View Geometry in 3D Foundation Models
CV and Pattern Recognition
Makes 3D pictures work even with weird camera angles.
Finding 3D Scene Analogies with Multimodal Foundation Models
CV and Pattern Recognition
Robots learn new places by comparing them to old ones.
HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
CV and Pattern Recognition
Helps computers understand 3D spaces from pictures and words.