Neural Audio Codecs for Prompt-Driven Universal Sound Separation
By: Adhiraj Banerjee, Vipul Arora
Potential Business Impact:
Lets phones separate sounds from music.
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to fixed-class separation. We introduce CodecSep, the first NAC-based model for on-device universal, text-driven separation. CodecSep combines DAC compression with a Transformer masker modulated by CLAP-derived FiLM parameters. Across six open-domain benchmarks under matched training/prompt protocols, \textbf{CodecSep} surpasses \textbf{AudioSep} in separation fidelity (SI-SDR) while remaining competitive in perceptual quality (ViSQOL) and matching or exceeding fixed-stem baselines (TDANet, CodecFormer, SDCodec). In code-stream deployments, it needs just 1.35~GMACs end-to-end -- approximately $54\times$ less compute ($25\times$ architecture-only) than spectrogram-domain separators like AudioSep -- while remaining fully bitstream-compatible.
Similar Papers
Neural Audio Codecs for Prompt-Driven Universal Source Separation
Sound
Lets phones separate music from speech.
Speech Enhancement Using Continuous Embeddings of Neural Audio Codec
Audio and Speech Processing
Cleans up noisy audio for clearer sound.
PromptSep: Generative Audio Separation via Multimodal Prompting
Sound
Lets you remove or pick sounds using voice.