Score: 3

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Published: October 30, 2025 | arXiv ID: 2510.26372v1

By: Chengwei Liu , Haoyin Yan , Shaofei Xue and more

BigTech Affiliations: Alibaba

Potential Business Impact:

Makes one AI create many kinds of sounds.

Business Areas:

Speech Recognition Data and Analytics, Software

Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

Computation and Language

Lets you edit spoken words with just your voice.

26 Oct 2025 2

89%

UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

Sound

Makes computers create any sound from text.

29 Sep 2025 1

88%

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

CV and Pattern Recognition

Lets computers understand and create pictures.

6 Apr 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com github.com

Page Count

21 pages

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Makes one AI create many kinds of sounds.

Technical Abstract

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding