Score: 2

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Published: December 3, 2025 | arXiv ID: 2512.03532v1

By: Zhishan Zhou , Siyuan Wei , Zengran Wang and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Lets robots understand and find any object.

Business Areas:

Image Recognition Data and Analytics, Software

Generalizing open-vocabulary 3D instance segmentation (OV-3DIS) to diverse, unstructured, and mesh-free environments is crucial for robotics and AR/VR, yet remains a significant challenge. We attribute this to two key limitations of existing methods: (1) proposal generation relies on dataset-specific proposal networks or mesh-based superpoints, rendering them inapplicable in mesh-free scenarios and limiting generalization to novel scenes; and (2) the weak textual reasoning of CLIP-based classifiers, which struggle to recognize compositional and functional user queries. To address these issues, we introduce OpenTrack3D, a generalizable and accurate framework. Unlike methods that rely on pre-generated proposals, OpenTrack3D employs a novel visual-spatial tracker to construct cross-view consistent object proposals online. Given an RGB-D stream, our pipeline first leverages a 2D open-vocabulary segmenter to generate masks, which are lifted to 3D point clouds using depth. Mask-guided instance features are then extracted using DINO feature maps, and our tracker fuses visual and spatial cues to maintain instance consistency. The core pipeline is entirely mesh-free, yet we also provide an optional superpoints refinement module to further enhance performance when scene mesh is available. Finally, we replace CLIP with a multi-modal large language model (MLLM), significantly enhancing compositional reasoning for complex user queries. Extensive experiments on diverse benchmarks, including ScanNet200, Replica, ScanNet++, and SceneFun3D, demonstrate state-of-the-art performance and strong generalization capabilities.

Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

CV and Pattern Recognition

Finds rare objects in 3D scans using text.

22 Dec 2025 1

92%

Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

CV and Pattern Recognition

Helps robots see and name objects in 3D.

30 Jul 2025 1

91%

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

CV and Pattern Recognition

Finds objects in 3D rooms without human labels.

27 Aug 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

12 pages

OpenTrack3D: Towards Accurate and Generalizable Open-Vocabulary 3D Instance Segmentation

Lets robots understand and find any object.

Technical Abstract

Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation

Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations