Score: 1

Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Published: November 24, 2025 | arXiv ID: 2511.18817v1

By: Siyuan Wei , Chunjie Wang , Xiao Liu and more

BigTech Affiliations: ByteDance

Potential Business Impact:

Makes 3D computer worlds talk and understand questions.

Business Areas:

Image Recognition Data and Analytics, Software

3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

CV and Pattern Recognition

Lets computers understand images by listening.

16 Nov 2025 0

89%

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

CV and Pattern Recognition

Makes 3D pictures match words better.

18 Nov 2025 1

89%

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

CV and Pattern Recognition

Helps computers understand 3D spaces from pictures and words.

28 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

16 pages

Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

Makes 3D computer worlds talk and understand questions.

Technical Abstract

DenseAnnotate: Enabling Scalable Dense Caption Collection for Images and 3D Scenes via Spoken Descriptions

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model