Score: 0

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Published: December 10, 2025 | arXiv ID: 2512.09504v1

By: Kang Yin , Chunyu Qiang , Sirui Zhao and more

Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Sound

Makes computers talk with perfect timing and clarity.

10 Oct 2025 0

89%

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

Sound

Makes computer voices sound more real and emotional.

2 Oct 2025 0

89%

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

Sound

Makes computers speak like real people, faster.

4 Dec 2025 1

View PDF Login to Bookmark

DMP-TTS: Disentangled multi-modal Prompting for Controllable Text-to-Speech with Chained Guidance

Technical Abstract

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis