Score: 2

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Published: March 3, 2025 | arXiv ID: 2503.01710v1

By: Xinsheng Wang , Mingqi Jiang , Ziyang Ma and more

Potential Business Impact:

Makes computers talk with any voice, any style.

Business Areas:

Text Analytics Data and Analytics, Software

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Sound

Makes computers talk like any person.

3 Oct 2025 1

89%

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Sound

Makes computer voices sound more natural and human.

26 Sep 2025 1

88%

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec

Sound

Lets you change a voice's emotion, not just its sound.

15 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Repos / Data Links

github.com github.com github.com github.com github.com github.com

Page Count

22 pages

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Makes computers talk with any voice, any style.

Technical Abstract

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

DisCo-Speech: Controllable Zero-Shot Speech Generation with A Disentangled Speech Codec