Score: 1

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Published: November 23, 2025 | arXiv ID: 2511.18290v1

By: Jungho Lee , Minhyeok Lee , Sunghun Yang and more

Potential Business Impact:

Builds detailed 3D maps much faster.

Business Areas:

Image Recognition Data and Analytics, Software

3D reconstruction in large-scale scenes is a fundamental task in 3D perception, but the inherent trade-off between accuracy and computational efficiency remains a significant challenge. Existing methods either prioritize speed and produce low-quality results, or achieve high-quality reconstruction at the cost of slow inference times. In this paper, we propose SwiftVGGT, a training-free method that significantly reduce inference time while preserving high-quality dense 3D reconstruction. To maintain global consistency in large-scale scenes, SwiftVGGT performs loop closure without relying on the external Visual Place Recognition (VPR) model. This removes redundant computation and enables accurate reconstruction over kilometer-scale environments. Furthermore, we propose a simple yet effective point sampling method to align neighboring chunks using a single Sim(3)-based Singular Value Decomposition (SVD) step. This eliminates the need for the Iteratively Reweighted Least Squares (IRLS) optimization commonly used in prior work, leading to substantial speed-ups. We evaluate SwiftVGGT on multiple datasets and show that it achieves state-of-the-art reconstruction quality while requiring only 33% of the inference time of recent VGGT-based large-scale reconstruction approaches.

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

CV and Pattern Recognition

Makes 3D pictures from many photos faster.

1 Dec 2025 1

92%

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

CV and Pattern Recognition

Lets computers remember 3D shapes forever.

5 Jan 2026 1

92%

GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss

CV and Pattern Recognition

Lets cameras find their place without labels.

23 Jan 2026 1

View PDF Login to Bookmark

Page Count

19 pages

SwiftVGGT: A Scalable Visual Geometry Grounded Transformer for Large-Scale Scenes

Builds detailed 3D maps much faster.

Technical Abstract

FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

GPA-VGGT:Adapting VGGT to Large scale Localization by self-Supervised learning with Geometry and Physics Aware loss