VibeVoice Technical Report
By: Zhiliang Peng , Jianwei Yu , Wenhui Wang and more
Potential Business Impact:
Creates long, natural-sounding conversations with many voices.
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
Similar Papers
VIBE: Video-Input Brain Encoder for fMRI Response Modeling
Machine Learning (CS)
Reads minds by watching movies and listening.
Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
Audio and Speech Processing
Makes computers speak more like real people.
Continuous-Token Diffusion for Speaker-Referenced TTS in Multimodal LLMs
Audio and Speech Processing
Makes computers talk like real people.