SAO-Instruct: Free-form Audio Editing using Natural Language Instructions
By: Michael Ungersböck , Florian Grötschla , Luca A. Lanzendörfer and more
Potential Business Impact:
Changes audio with any spoken words.
Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
Similar Papers
InstructAudio: Unified speech and music generation with natural language instruction
Audio and Speech Processing
Makes computers create speech and music from words.
Guiding Audio Editing with Audio Language Model
Sound
Lets you tell computers how to change sounds.
Step-Audio-EditX Technical Report
Computation and Language
Changes voice to sound happy or sad.