Score: 1

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Published: January 15, 2026 | arXiv ID: 2601.10770v1

By: Runyuan Cai , Yu Lin , Yiming Wang and more

Potential Business Impact:

One AI understands and makes all speech sounds.

Business Areas:

Speech Recognition Data and Analytics, Software

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Audio and Speech Processing

Lets computers understand and speak like people.

6 Oct 2025 0

90%

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Audio and Speech Processing

Makes computer voices sound more human and real.

6 Aug 2025 1

89%

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Audio and Speech Processing

Makes computer voices sound more like real people.

6 Aug 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

14 pages

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

One AI understands and makes all speech sounds.

Technical Abstract

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech