Score: 1

Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

Published: June 16, 2025 | arXiv ID: 2506.13594v1

By: Weimin Bai , Yubo Li , Wenzheng Chen and more

Potential Business Impact:

Creates more varied and realistic 3D objects from text.

Business Areas:

Diving Sports

Distilling pre-trained 2D diffusion models into 3D assets has driven remarkable advances in text-to-3D synthesis. However, existing methods typically rely on Score Distillation Sampling (SDS) loss, which involves asymmetric KL divergence--a formulation that inherently favors mode-seeking behavior and limits generation diversity. In this paper, we introduce Dive3D, a novel text-to-3D generation framework that replaces KL-based objectives with Score Implicit Matching (SIM) loss, a score-based objective that effectively mitigates mode collapse. Furthermore, Dive3D integrates both diffusion distillation and reward-guided optimization under a unified divergence perspective. Such reformulation, together with SIM loss, yields significantly more diverse 3D outputs while improving text alignment, human preference, and overall visual fidelity. We validate Dive3D across various 2D-to-3D prompts and find that it consistently outperforms prior methods in qualitative assessments, including diversity, photorealism, and aesthetic appeal. We further evaluate its performance on the GPTEval3D benchmark, comparing against nine state-of-the-art baselines. Dive3D also achieves strong results on quantitative metrics, including text-asset alignment, 3D plausibility, text-geometry consistency, texture quality, and geometric detail.

Text-to-3D Generation using Jensen-Shannon Score Distillation

CV and Pattern Recognition

Creates better 3D pictures from words.

8 Mar 2025 1

90%

Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

CV and Pattern Recognition

Creates realistic 3D objects from text descriptions.

19 Sep 2025 1

89%

CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback

CV and Pattern Recognition

Makes 3D pictures match words better.

28 Apr 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

18 pages

Dive3D: Diverse Distillation-based Text-to-3D Generation via Score Implicit Matching

Creates more varied and realistic 3D objects from text.

Technical Abstract

Text-to-3D Generation using Jensen-Shannon Score Distillation

Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation

CoherenDream: Boosting Holistic Text Coherence in 3D Generation via Multimodal Large Language Models Feedback