Score: 0

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Published: November 15, 2025 | arXiv ID: 2511.12347v1

By: Zhisheng Zheng , Puyuan Peng , Anuj Diwan and more

Potential Business Impact:

Makes any voice speak any language perfectly.

Business Areas:

Speech Recognition Data and Analytics, Software

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

CV and Pattern Recognition

Makes videos speak with matching faces.

3 Apr 2025 0

88%

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Sound

Makes voices sound like anyone, in any language.

18 Sep 2025 0

88%

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Sound

Lets computers copy voices in new languages.

18 Sep 2025 0

View PDF Login to Bookmark

Page Count

20 pages

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

Makes any voice speak any language perfectly.

Technical Abstract

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis