Score: 1

Faster VGGT with Block-Sparse Global Attention

Published: September 8, 2025 | arXiv ID: 2509.07120v1

By: Chung-Shien Brian Wang , Christian Schmidt , Jens Piekenbrinck and more

Potential Business Impact:

Makes 3D models from pictures much faster.

Business Areas:

Image Recognition Data and Analytics, Software

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

AVGGT: Rethinking Global Attention for Accelerating VGGT

CV and Pattern Recognition

Makes 3D pictures from many photos faster.

2 Dec 2025 0

91%

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

CV and Pattern Recognition

Makes 3D pictures from many photos faster.

1 Dec 2025 1

90%

Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective

CV and Pattern Recognition

Fixes 3D picture-making computer programs.

25 Dec 2025 1

View PDF Login to Bookmark

Page Count

29 pages

Faster VGGT with Block-Sparse Global Attention

Makes 3D models from pictures much faster.

Technical Abstract

AVGGT: Rethinking Global Attention for Accelerating VGGT

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

Analyzing the Mechanism of Attention Collapse in VGGT from a Dynamics Perspective