SceneSplat++: A Large Dataset and Comprehensive Benchmark for Language Gaussian Splatting
By: Mengjiao Ma , Qi Ma , Yue Li and more
Potential Business Impact:
Teaches computers to understand 3D worlds better.
3D Gaussian Splatting (3DGS) serves as a highly performant and efficient encoding of scene geometry, appearance, and semantics. Moreover, grounding language in 3D scenes has proven to be an effective strategy for 3D scene understanding. Current Language Gaussian Splatting line of work fall into three main groups: (i) per-scene optimization-based, (ii) per-scene optimization-free, and (iii) generalizable approach. However, most of them are evaluated only on rendered 2D views of a handful of scenes and viewpoints close to the training views, limiting ability and insight into holistic 3D understanding. To address this gap, we propose the first large-scale benchmark that systematically assesses these three groups of methods directly in 3D space, evaluating on 1060 scenes across three indoor datasets and one outdoor dataset. Benchmark results demonstrate a clear advantage of the generalizable paradigm, particularly in relaxing the scene-specific limitation, enabling fast feed-forward inference on novel scenes, and achieving superior segmentation performance. We further introduce GaussianWorld-49K a carefully curated 3DGS dataset comprising around 49K diverse indoor and outdoor scenes obtained from multiple sources, with which we demonstrate the generalizable approach could harness strong data priors. Our codes, benchmark, and datasets will be made public to accelerate research in generalizable 3DGS scene understanding.
Similar Papers
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
CV and Pattern Recognition
Teaches computers to understand 3D spaces from scans.
GSsplat: Generalizable Semantic Gaussian Splatting for Novel-view Synthesis in 3D Scenes
Graphics
Makes 3D scenes understandable from many angles.
SplatTalk: 3D VQA with Gaussian Splatting
CV and Pattern Recognition
Lets computers understand 3D worlds from pictures.