Score: 0

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Published: January 14, 2026 | arXiv ID: 2601.09575v1

By: Sheng-Yu Huang , Jaesung Choe , Yu-Chiang Frank Wang and more

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

Robotics

Robots understand and map places using words.

23 Feb 2025 1

89%

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

CV and Pattern Recognition

Finds objects in 3D rooms without human labels.

27 Aug 2025 0

89%

Native and Compact Structured Latents for 3D Generation

CV and Pattern Recognition

Creates more realistic 3D objects with complex shapes.

16 Dec 2025 1

View PDF Login to Bookmark

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

Technical Abstract

OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Native and Compact Structured Latents for 3D Generation