Score: 0

Taking Language Embedded 3D Gaussian Splatting into the Wild

Published: July 26, 2025 | arXiv ID: 2507.19830v2

By: Yuze Wang, Yue Qi

Potential Business Impact:

Lets computers understand building styles from photos.

Plain English Summary

Imagine being able to explore famous buildings virtually, not just by looking at pictures, but by actually walking through them and understanding how they're built. This new method uses lots of online photos to create detailed 3D models of buildings that you can interact with. This means you could ask questions like "show me all the arches" and the computer would highlight them, making it easier to learn about architecture or even design new buildings.

Recent advances in leveraging large-scale Internet photo collections for 3D reconstruction have enabled immersive virtual exploration of landmarks and historic sites worldwide. However, little attention has been given to the immersive understanding of architectural styles and structural knowledge, which remains largely confined to browsing static text-image pairs. Therefore, can we draw inspiration from 3D in-the-wild reconstruction techniques and use unconstrained photo collections to create an immersive approach for understanding the 3D structure of architectural components? To this end, we extend language embedded 3D Gaussian splatting (3DGS) and propose a novel framework for open-vocabulary scene understanding from unconstrained photo collections. Specifically, we first render multiple appearance images from the same viewpoint as the unconstrained image with the reconstructed radiance field, then extract multi-appearance CLIP features and two types of language feature uncertainty maps-transient and appearance uncertainty-derived from the multi-appearance features to guide the subsequent optimization process. Next, we propose a transient uncertainty-aware autoencoder, a multi-appearance language field 3DGS representation, and a post-ensemble strategy to effectively compress, learn, and fuse language features from multiple appearances. Finally, to quantitatively evaluate our method, we introduce PT-OVS, a new benchmark dataset for assessing open-vocabulary segmentation performance on unconstrained photo collections. Experimental results show that our method outperforms existing methods, delivering accurate open-vocabulary segmentation and enabling applications such as interactive roaming with open-vocabulary queries, architectural style pattern recognition, and 3D scene editing.

Country of Origin
🇨🇳 China

Page Count
16 pages

Category
Computer Science:
Graphics