Score: 0

3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

Published: May 3, 2025 | arXiv ID: 2505.01809v1

By: Xiaoqi Li , Jiaming Liu , Nuowei Han and more

Potential Business Impact:

Finds specific objects in 3D scans using words.

Business Areas:

Image Recognition Data and Analytics, Software

The 3D weakly-supervised visual grounding task aims to localize oriented 3D boxes in point clouds based on natural language descriptions without requiring annotations to guide model learning. This setting presents two primary challenges: category-level ambiguity and instance-level complexity. Category-level ambiguity arises from representing objects of fine-grained categories in a highly sparse point cloud format, making category distinction challenging. Instance-level complexity stems from multiple instances of the same category coexisting in a scene, leading to distractions during grounding. To address these challenges, we propose a novel weakly-supervised grounding approach that explicitly differentiates between categories and instances. In the category-level branch, we utilize extensive category knowledge from a pre-trained external detector to align object proposal features with sentence-level category features, thereby enhancing category awareness. In the instance-level branch, we utilize spatial relationship descriptions from language queries to refine object proposal features, ensuring clear differentiation among objects. These designs enable our model to accurately identify target-category objects while distinguishing instances within the same category. Compared to previous methods, our approach achieves state-of-the-art performance on three widely used benchmarks: Nr3D, Sr3D, and ScanRef.

CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection

CV and Pattern Recognition

Helps cars see in 3D with less training.

6 Mar 2025 1

88%

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

CV and Pattern Recognition

Robots find objects using spoken words.

8 May 2025 1

88%

Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding

CV and Pattern Recognition

Helps robots find objects using words.

5 Jun 2025 2

View PDF Login to Bookmark

Page Count

8 pages

3DWG: 3D Weakly Supervised Visual Grounding via Category and Instance-Level Alignment

Finds specific objects in 3D scans using words.

Technical Abstract

CA-W3D: Leveraging Context-Aware Knowledge for Weakly Supervised Monocular 3D Detection

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding

Grounding Beyond Detection: Enhancing Contextual Understanding in Embodied 3D Grounding