DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
By: Zongcai Du , Guilin Deng , Xiaofeng Guo and more
Potential Business Impact:
Makes AI sing songs with real-sounding voices.
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
Similar Papers
YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance
Sound
Makes computers sing any song with any words.
CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance
Sound
Makes computer singing sound more like real people.
DiTSE: High-Fidelity Generative Speech Enhancement via Latent Diffusion Transformers
Audio and Speech Processing
Cleans up noisy and echoey voices perfectly.