Score: 1

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Published: November 26, 2025 | arXiv ID: 2511.21342v1

By: Genís Plaja-Roglans , Yun-Ning Hung , Xavier Serra and more

Potential Business Impact:

Cleans up music to hear just the singer.

Business Areas:

Speech Recognition Data and Analytics, Software

Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Sound

Separates singing voice from music perfectly.

25 Nov 2025 1

90%

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Audio and Speech Processing

Separates voices from mixed sounds using AI.

29 Sep 2025 0

89%

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Audio and Speech Processing

Separates sounds from recordings without needing perfect examples.

8 Dec 2025 0

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

5 pages

Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Cleans up music to hear just the singer.

Technical Abstract

Efficient and Fast Generative-Based Singing Voice Separation using a Latent Diffusion Model

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors