Score: 1

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Published: September 26, 2025 | arXiv ID: 2509.22062v1

By: Junjie Cao , Yichen Han , Ruonan Zhang and more

Potential Business Impact:

Makes computer voices sound more natural and human.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Existing Large Language Model (LLM) based autoregressive (AR) text-to-speech (TTS) systems, while achieving state-of-the-art quality, still face critical challenges. The foundation of this LLM-based paradigm is the discretization of the continuous speech waveform into a sequence of discrete tokens by neural audio codec. However, single codebook modeling is well suited to text LLMs, but suffers from significant information loss; hierarchical acoustic tokens, typically generated via Residual Vector Quantization (RVQ), often lack explicit semantic structure, placing a heavy learning burden on the model. Furthermore, the autoregressive process is inherently susceptible to error accumulation, which can degrade generation stability. To address these limitations, we propose CaT-TTS, a novel framework for robust and semantically-grounded zero-shot synthesis. First, we introduce S3Codec, a split RVQ codec that injects explicit linguistic features into its primary codebook via semantic distillation from a state-of-the-art ASR model, providing a structured representation that simplifies the learning task. Second, we propose an ``Understand-then-Generate'' dual-Transformer architecture that decouples comprehension from rendering. An initial ``Understanding'' Transformer models the cross-modal relationship between text and the audio's semantic tokens to form a high-level utterance plan. A subsequent ``Generation'' Transformer then executes this plan, autoregressively synthesizing hierarchical acoustic tokens. Finally, to enhance generation stability, we introduce Masked Audio Parallel Inference (MAPI), a nearly parameter-free inference strategy that dynamically guides the decoding process to mitigate local errors.

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Computation and Language

Makes computers talk like people, understanding words and sounds.

24 Sep 2025 0

91%

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Audio and Speech Processing

Lets computers understand and speak like people.

6 Oct 2025 0

90%

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Audio and Speech Processing

Makes computer voices sound more human and real.

6 Aug 2025 1

View PDF Login to Bookmark

Page Count

17 pages

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Makes computer voices sound more natural and human.

Technical Abstract

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech