Score: 0

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Published: September 24, 2025 | arXiv ID: 2509.20072v2

By: Tianqiao Liu , Xueyi Li , Hao Wang and more

Potential Business Impact:

Makes computers talk like people, understanding words and sounds.

Business Areas:

Text Analytics Data and Analytics, Software

Recent advances in large language models (LLMs) have attracted significant interest in extending their capabilities to multimodal scenarios, particularly for speech-to-speech conversational systems. However, existing multimodal models handling interleaved audio and text rely on autoregressive methods, overlooking that text depends on target-target relations whereas audio depends mainly on source-target relations. In this work, we propose Text-to-Talk (TtT), a unified audio-text framework that integrates autoregressive (AR) text generation with non-autoregressive (NAR) audio diffusion in a single Transformer. By leveraging the any-order autoregressive property of absorbing discrete diffusion, our approach provides a unified training objective for text and audio. To support this hybrid generation paradigm, we design a modality-aware attention mechanism that enforces causal decoding for text while allowing bidirectional modeling within audio spans, and further introduce three training strategies that reduce train-test discrepancies. During inference, TtT employs block-wise diffusion to synthesize audio in parallel while flexibly handling variable-length outputs. Extensive experiments across Audio-QA and ASR tasks demonstrate the effectiveness of our approach, with detailed ablation studies validating each proposed component. We will open-source our models, data and code to facilitate future research in this direction.

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Sound

Makes computer voices sound more natural and human.

26 Sep 2025 1

90%

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Audio and Speech Processing

Makes computer voices sound more human and real.

6 Aug 2025 1

90%

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

Audio and Speech Processing

Makes computers talk like people faster.

14 Apr 2025 1

View PDF Login to Bookmark

Page Count

23 pages

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

Makes computers talk like people, understanding words and sounds.

Technical Abstract

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis