UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens
By: Chengwei Liu , Haoyin Yan , Shaofei Xue and more
Potential Business Impact:
Makes one AI create many kinds of sounds.
Generative modeling has recently achieved remarkable success across text, image, and audio domains, demonstrating powerful capabilities for unified representation learning. However, audio generation models still face challenges in terms of audio quality and generalization ability across tasks. This fragmentation results in redundant development efforts, inconsistent performance, and limited extensibility. To address these issues, we propose \textbf{UniTok-Audio}, a scalable and extensible framework for unified audio generation tasks. Specifically, 1) UniTok-Audio extracts continuous feature of conditions to generates discrete tokens of target audio in an autoregressive manner; 2) a special task identifier token unifies different learning patterns of multiple tasks in a single framework; 3) a dual-stream audio codec involving acoustic and semantic branch is developed for high-fidelity waveform reconstruction. Experimental results demonstrate that UniTok-Audio achieves competitive performance in comparation with state-of-the-art task-specific or multi-task systems across five time-aligned tasks: speech restoration, target speaker extraction, speech separation, voice conversion, and language-queried audio source separation. To foster future research, we will open-source our codebase. The demo page of our work can be found here: https://alibaba.github.io/unified-audio.
Similar Papers
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
Computation and Language
Lets you edit spoken words with just your voice.
UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities
Sound
Makes computers create any sound from text.
UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding
CV and Pattern Recognition
Lets computers understand and create pictures.