ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
By: Han Zhu , Wei Kang , Zengwei Yao and more
Potential Business Impact:
Makes computer voices sound real, much faster.
Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality flow-matching-based zero-shot TTS model with a compact model size and fast inference speed. Key designs include: 1) a Zipformer-based vector field estimator to maintain adequate modeling capabilities under constrained size; 2) Average upsampling-based initial speech-text alignment and Zipformer-based text encoder to improve speech intelligibility; 3) A flow distillation method to reduce sampling steps and eliminate the inference overhead associated with classifier-free guidance. Experiments on 100k hours multilingual datasets show that ZipVoice matches state-of-the-art models in speech quality, while being 3 times smaller and up to 30 times faster than a DiT-based flow-matching baseline. Codes, model checkpoints and demo samples are publicly available at https://github.com/k2-fsa/ZipVoice.
Similar Papers
ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Audio and Speech Processing
Makes talking computers sound like real people.
Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
Sound
Makes computers talk like any person.
DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Sound
Makes voices sound like anyone, super fast.