Diffusion-Based Unsupervised Audio-Visual Speech Separation in Noisy Environments with Noise Prior
By: Yochai Yemini , Rami Ben-Ari , Sharon Gannot and more
Potential Business Impact:
Cleans up noisy audio to hear voices better.
In this paper, we address the problem of single-microphone speech separation in the presence of ambient noise. We propose a generative unsupervised technique that directly models both clean speech and structured noise components, training exclusively on these individual signals rather than noisy mixtures. Our approach leverages an audio-visual score model that incorporates visual cues to serve as a strong generative speech prior. By explicitly modelling the noise distribution alongside the speech distribution, we enable effective decomposition through the inverse problem paradigm. We perform speech separation by sampling from the posterior distributions via a reverse diffusion process, which directly estimates and removes the modelled noise component to recover clean constituent signals. Experimental results demonstrate promising performance, highlighting the effectiveness of our direct noise modelling approach in challenging acoustic environments.
Similar Papers
Unsupervised Single-Channel Audio Separation with Diffusion Source Priors
Audio and Speech Processing
Separates music into individual instruments without needing original recordings.
Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance
Audio and Speech Processing
Separates voices from mixed sounds using AI.
Unsupervised Speech Enhancement using Data-defined Priors
Audio and Speech Processing
Cleans up noisy voices without needing perfect examples.