Score: 4

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Published: June 1, 2025 | arXiv ID: 2506.00885v1

By: Leying Zhang , Yao Qian , Xiaofei Wang and more

BigTech Affiliations: Microsoft

Potential Business Impact:

Makes computers create realistic talking conversations.

Business Areas:

Speech Recognition Data and Analytics, Software

Generating natural-sounding, multi-speaker dialogue is crucial for applications such as podcast creation, virtual agents, and multimedia content generation. However, existing systems struggle to maintain speaker consistency, model overlapping speech, and synthesize coherent conversations efficiently. In this paper, we introduce CoVoMix2, a fully non-autoregressive framework for zero-shot multi-talker dialogue generation. CoVoMix2 directly predicts mel-spectrograms from multi-stream transcriptions using a flow-matching-based generative model, eliminating the reliance on intermediate token representations. To better capture realistic conversational dynamics, we propose transcription-level speaker disentanglement, sentence-level alignment, and prompt-level random masking strategies. Our approach achieves state-of-the-art performance, outperforming strong baselines like MoonCast and Sesame in speech quality, speaker consistency, and inference speed. Notably, CoVoMix2 operates without requiring transcriptions for the prompt and supports controllable dialogue generation, including overlapping speech and precise timing control, demonstrating strong generalizability to real-world speech generation scenarios.

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Audio and Speech Processing

Makes talking videos sound and look like the original speaker.

14 Mar 2025 1

88%

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

Sound

Changes voice to sound like someone else.

4 Jun 2025 1

87%

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling

Sound

Makes computers talk like real people instantly.

14 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇺🇸 China, United States

Repos / Data Links

github.com github.com

Page Count

16 pages

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

Makes computers create realistic talking conversations.

Technical Abstract

MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion

StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling