Score: 2

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Published: August 8, 2025 | arXiv ID: 2508.06393v1

By: Md Asif Jalal , Luca Remaggi , Vasileios Moschopoulos and more

BigTech Affiliations: Samsung

Potential Business Impact:

Lets computers separate voices in noisy rooms.

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing enrollment-free methods capable of identifying targets without explicit speaker labeling. This work introduces a new approach to train simultaneous speech separation and diarization using automatic identification of target speaker embeddings, within mixtures. Our proposed model employs a dual-stage training pipeline designed to learn robust speaker representation features that are resilient to background noise interference. Furthermore, we present an overlapping spectral loss function specifically tailored for enhancing diarization accuracy during overlapped speech frames. Experimental results show significant performance gains compared to the current SOTA baseline, achieving 71% relative improvement in DER and 69% in cpWER.

Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering

Audio and Speech Processing

Tells who is speaking, even with many voices.

19 Jun 2025 0

89%

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

Audio and Speech Processing

Makes computers better at telling speakers apart.

18 Sep 2025 0

89%

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance

Audio and Speech Processing

Separates voices from mixed sounds using AI.

29 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 South Korea

Page Count

5 pages

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Lets computers separate voices in noisy rooms.

Technical Abstract

Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering

Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation

Unsupervised Single-Channel Speech Separation with a Diffusion Prior under Speaker-Embedding Guidance