OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding
By: Sheng-Yu Huang , Jaesung Choe , Yu-Chiang Frank Wang and more
We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.
Similar Papers
OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation
Robotics
Robots understand and map places using words.
OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations
CV and Pattern Recognition
Finds objects in 3D rooms without human labels.
Native and Compact Structured Latents for 3D Generation
CV and Pattern Recognition
Creates more realistic 3D objects with complex shapes.