Score: 0

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Published: September 29, 2025 | arXiv ID: 2509.24395v1

By: Runwu Shi , Kai Li , Chang Li and more

Potential Business Impact:

Separates voices from mixed sounds using AI.

Business Areas:

Speech Recognition Data and Analytics, Software

Speech separation is a fundamental task in audio processing, typically addressed with fully supervised systems trained on paired mixtures. While effective, such systems typically rely on synthetic data pipelines, which may not reflect real-world conditions. Instead, we revisit the source-model paradigm, training a diffusion generative model solely on anechoic speech and formulating separation as a diffusion inverse problem. However, unconditional diffusion models lack speaker-level conditioning, they can capture local acoustic structure but produce temporally inconsistent speaker identities in separated sources. To address this limitation, we propose Speaker-Embedding guidance that, during the reverse diffusion process, maintains speaker coherence within each separated track while driving embeddings of different speakers further apart. In addition, we propose a new separation-oriented solver tailored for speech separation, and both strategies effectively enhance performance on the challenging task of unsupervised source-model-based speech separation, as confirmed by extensive experimental results. Audio samples and code are available at https://runwushi.github.io/UnSepDiff_demo.

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Audio and Speech Processing

Separates music into individual instruments without needing original recordings.

8 Dec 2025 0

91%

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Sound

Separates singing voice from music perfectly.

25 Nov 2025 1

90%

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior

Audio and Speech Processing

Cleans up noisy audio to hear voices better.

17 Sep 2025 0

View PDF Login to Bookmark

Page Count

5 pages

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Separates voices from mixed sounds using AI.

Technical Abstract

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior