Score: 0

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Published: August 31, 2025 | arXiv ID: 2509.00683v1

By: Zihao Zheng , Zeyu Xie , Xuenan Xu and more

Potential Business Impact:

Makes computers create realistic sounds from text.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Controllable text-to-audio generation (TTA) has attracted much attention recently. Although existing works can achieve fine-grained controllability based on timestamp information, sound event categories are limited to a fixed set. Moreover, since only simulated data is used for training, the generated audio quality and generalization performance on real data are limited. To tackle this issue, we propose PicoAudio2, improving temporal-controllable TTA via a new data processing pipeline and model architecture. Specifically, we use a grounding model to annotate event timestamps of real audio-text datasets to curate temporally-strong real data, in addition to simulation data from existing works. The model is trained on the combination of real and simulation data. Moreover, following PicoAudio, we encode timestamp information into a timestamp matrix to provide extra fine-grained time-aligned information to the model, on top of the coarse-grained textual description. Experiments show that PicoAudio2 exhibits superior performance in terms of temporal controllability and audio quality.

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

Sound

Makes computers talk with perfect timing and clarity.

10 Oct 2025 0

89%

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

Sound

Makes AI create better sound stories from words.

15 May 2025 1

88%

InstructAudio: Unified speech and music generation with natural language instruction

Audio and Speech Processing

Makes computers create speech and music from words.

23 Nov 2025 0

View PDF Login to Bookmark

Page Count

5 pages

PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description

Makes computers create realistic sounds from text.

Technical Abstract

ControlAudio: Tackling Text-Guided, Timing-Indicated and Intelligible Audio Generation via Progressive Diffusion Modeling

T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

InstructAudio: Unified speech and music generation with natural language instruction