Mitigating Intra-Speaker Variability in Diarization with Style-Controllable Speech Augmentation
By: Miseul Kim , Soo Jin Park , Kyungguen Byun and more
Potential Business Impact:
Makes computers better at telling speakers apart.
Speaker diarization systems often struggle with high intrinsic intra-speaker variability, such as shifts in emotion, health, or content. This can cause segments from the same speaker to be misclassified as different individuals, for example, when one raises their voice or speaks faster during conversation. To address this, we propose a style-controllable speech generation model that augments speech across diverse styles while preserving the target speaker's identity. The proposed system starts with diarized segments from a conventional diarizer. For each diarized segment, it generates augmented speech samples enriched with phonetic and stylistic diversity. And then, speaker embeddings from both the original and generated audio are blended to enhance the system's robustness in grouping segments with high intrinsic intra-speaker variability. We validate our approach on a simulated emotional speech dataset and the truncated AMI dataset, demonstrating significant improvements, with error rate reductions of 49% and 35% on each dataset, respectively.
Similar Papers
Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling
Sound
Lets computers separate voices in noisy rooms.
StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation
Multimedia
Makes talking faces move realistically for any person.
Emotional Text-To-Speech Based on Mutual-Information-Guided Emotion-Timbre Disentanglement
Sound
Makes computer voices sound more real and emotional.