Score: 1

MoonCast: High-Quality Zero-Shot Podcast Generation

Published: March 18, 2025 | arXiv ID: 2503.14345v2

By: Zeqian Ju , Dongchao Yang , Jianwei Yu and more

Potential Business Impact:

Creates realistic podcast voices from any text.

Business Areas:

Podcast Media and Entertainment, Music and Audio

Recent advances in text-to-speech synthesis have achieved notable success in generating high-quality short utterances for individual speakers. However, these systems still face challenges when extending their capabilities to long, multi-speaker, and spontaneous dialogues, typical of real-world scenarios such as podcasts. These limitations arise from two primary challenges: 1) long speech: podcasts typically span several minutes, exceeding the upper limit of most existing work; 2) spontaneity: podcasts are marked by their spontaneous, oral nature, which sharply contrasts with formal, written contexts; existing works often fall short in capturing this spontaneity. In this paper, we propose MoonCast, a solution for high-quality zero-shot podcast generation, aiming to synthesize natural podcast-style speech from text-only sources (e.g., stories, technical reports, news in TXT, PDF, or Web URL formats) using the voices of unseen speakers. To generate long audio, we adopt a long-context language model-based audio modeling approach utilizing large-scale long-context speech data. To enhance spontaneity, we utilize a podcast generation module to generate scripts with spontaneous details, which have been empirically shown to be as crucial as the text-to-speech modeling itself. Experiments demonstrate that MoonCast outperforms baselines, with particularly notable improvements in spontaneity and coherence.

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Audio and Speech Processing

Makes fake voices have real conversations.

27 Oct 2025 2

88%

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Audio and Speech Processing

Makes fake voices talk like real people chatting.

27 Oct 2025 2

88%

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Sound

Makes computers talk like any person.

3 Oct 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

24 pages

MoonCast: High-Quality Zero-Shot Podcast Generation

Creates realistic podcast voices from any text.

Technical Abstract

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech