Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance
By: Runwu Shi , Kai Li , Chang Li and more
Potential Business Impact:
Separates voices from mixed sounds using AI.
Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.
Similar Papers
Unsupervised Single-Channel Audio Separation with Diffusion Source Priors
Audio and Speech Processing
Separates music into individual instruments without needing original recordings.
Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model
Sound
Separates singing voice from music perfectly.
Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
Audio and Speech Processing
Cleans up noisy audio to hear voices better.