Score: 1

UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

Published: September 29, 2025 | arXiv ID: 2509.24391v1

By: Xuenan Xu , Jiahao Mei , Zihao Zheng and more

Potential Business Impact:

Makes computers create any sound from text.

Business Areas:

Audio Media and Entertainment, Music and Audio

Audio generation, including speech, music and sound effects, has advanced rapidly in recent years. These tasks can be divided into two categories: time-aligned (TA) tasks, where each input unit corresponds to a specific segment of the output audio (e.g., phonemes aligned with frames in speech synthesis); and non-time-aligned (NTA) tasks, where such alignment is not available. Since modeling paradigms for the two types are typically different, research on different audio generation tasks has traditionally followed separate trajectories. However, audio is not inherently divided into such categories, making a unified model a natural and necessary goal for general audio generation. Previous unified audio generation works have adopted autoregressive architectures, while unified non-autoregressive approaches remain largely unexplored. In this work, we propose UniFlow-Audio, a universal audio generation framework based on flow matching. We propose a dual-fusion mechanism that temporally aligns audio latents with TA features and integrates NTA features via cross-attention in each model block. Task-balanced data sampling is employed to maintain strong performance across both TA and NTA tasks. UniFlow-Audio supports omni-modalities, including text, audio, and video. By leveraging the advantage of multi-task learning and the generative modeling capabilities of flow matching, UniFlow-Audio achieves strong results across 7 tasks using fewer than 8K hours of public training data and under 1B trainable parameters. Even the small variant with only ~200M trainable parameters shows competitive performance, highlighting UniFlow-Audio as a potential non-auto-regressive foundation model for audio generation. Code and models will be available at https://wsntxxn.github.io/uniflow_audio.

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

Sound

Makes one AI create many kinds of sounds.

30 Oct 2025 3

88%

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

Audio and Speech Processing

Lets computers understand and speak like people.

6 Oct 2025 0

88%

MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation

Sound

Makes silent videos talk in one step.

8 Sep 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com github.com

Page Count

19 pages

UniFlow-Audio: Unified Flow Matching for Audio Generation from Omni-Modalities

Makes computers create any sound from text.

Technical Abstract

UniTok-Audio: A Unified Audio Generation Framework via Generative Modeling on Discrete Codec Tokens

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation