Score: 2

DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

Published: August 19, 2025 | arXiv ID: 2508.13786v1

By: Yisu Liu , Chenxing Li , Wanqian Zhang and more

BigTech Affiliations: Tencent

Potential Business Impact:

Makes sounds match words exactly, even timing.

Business Areas:

Digital Media Media and Entertainment

Controllable text-to-audio generation aims to synthesize audio from textual descriptions while satisfying user-specified constraints, including event types, temporal sequences, and onset and offset timestamps. This enables precise control over both the content and temporal structure of the generated audio. Despite recent progress, existing methods still face inherent trade-offs among accurate temporal localization, open-vocabulary scalability, and practical efficiency. To address these challenges, we propose DegDiT, a novel dynamic event graph-guided diffusion transformer framework for open-vocabulary controllable audio generation. DegDiT encodes the events in the description as structured dynamic graphs. The nodes in each graph are designed to represent three aspects: semantic features, temporal attributes, and inter-event connections. A graph transformer is employed to integrate these nodes and produce contextualized event embeddings that serve as guidance for the diffusion model. To ensure high-quality and diverse training data, we introduce a quality-balanced data selection pipeline that combines hierarchical event annotation with multi-criteria quality scoring, resulting in a curated dataset with semantic diversity. Furthermore, we present consensus preference optimization, facilitating audio generation through consensus among multiple reward signals. Extensive experiments on AudioCondition, DESED, and AudioTime datasets demonstrate that DegDiT achieves state-of-the-art performances across a variety of objective and subjective evaluation metrics.

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Sound

Makes computers talk with perfect timing and clarity.

10 Oct 2025 0

88%

SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

CV and Pattern Recognition

Makes videos match your exact descriptions.

23 Aug 2025 0

88%

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers

Audio and Speech Processing

Cleans up noisy and echoey voices perfectly.

13 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

10 pages

DegDiT: Controllable Audio Generation with Dynamic Event Graph Guided Diffusion Transformer

Makes sounds match words exactly, even timing.

Technical Abstract

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation

DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers