Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
By: Liwei Liao , Xufeng Li , Xiaoyun Zheng and more
Potential Business Impact:
Find objects in 3D worlds with just words.
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
Similar Papers
Unified Representation Space for 3D Visual Grounding
CV and Pattern Recognition
Helps computers find objects in 3D using words.
MVGSR: Multi-View Consistency Gaussian Splatting for Robust Surface Reconstruction
CV and Pattern Recognition
Makes 3D models from moving pictures accurately.
ChangingGrounding: 3D Visual Grounding in Changing Scenes
CV and Pattern Recognition
Robots find things in changing rooms using memory.