Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
By: Liwei Liao , Xufeng Li , Xiaoyun Zheng and more
Potential Business Impact:
Find objects in 3D worlds with just words.
3D Visual Grounding (3DVG) aims to locate objects in 3D scenes based on text prompts, which is essential for applications such as robotics. However, existing 3DVG methods encounter two main challenges: first, they struggle to handle the implicit representation of spatial textures in 3D Gaussian Splatting (3DGS), making per-scene training indispensable; second, they typically require larges amounts of labeled data for effective training. To this end, we propose \underline{G}rounding via \underline{V}iew \underline{R}etrieval (GVR), a novel zero-shot visual grounding framework for 3DGS to transform 3DVG as a 2D retrieval task that leverages object-level view retrieval to collect grounding clues from multiple views, which not only avoids the costly process of 3D annotation, but also eliminates the need for per-scene training. Extensive experiments demonstrate that our method achieves state-of-the-art visual grounding performance while avoiding per-scene training, providing a solid foundation for zero-shot 3DVG research. Video demos can be found in https://github.com/leviome/GVR_demos.
Similar Papers
Zero-Shot 3D Visual Grounding from Vision-Language Models
CV and Pattern Recognition
Finds objects in 3D using words, no special training.
View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs
CV and Pattern Recognition
Helps robots find objects using words.
Unified Representation Space for 3D Visual Grounding
CV and Pattern Recognition
Helps computers find objects in 3D using words.