Spatio-spectral diarization of meetings by combining TDOA-based segmentation and speaker embedding-based clustering
By: Tobias Cord-Landwehr , Tobias Gburrek , Marc Deegen and more
Potential Business Impact:
Tells who is speaking, even with many voices.
We propose a spatio-spectral, combined model-based and data-driven diarization pipeline consisting of TDOA-based segmentation followed by embedding-based clustering. The proposed system requires neither access to multi-channel training data nor prior knowledge about the number or placement of microphones. It works for both a compact microphone array and distributed microphones, with minor adjustments. Due to its superior handling of overlapping speech during segmentation, the proposed pipeline significantly outperforms the single-channel pyannote approach, both in a scenario with a compact microphone array and in a setup with distributed microphones. Additionally, we show that, unlike fully spatial diarization pipelines, the proposed system can correctly track speakers when they change positions.
Similar Papers
Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling
Sound
Lets computers separate voices in noisy rooms.
Spatial Audio Processing with Large Language Model on Wearable Devices
Sound
Listens to where sounds come from.
Multi-Stage Speaker Diarization for Noisy Classrooms
Sound
Helps computers know who spoke in noisy classrooms.