Domain Adaptation of the Pyannote Diarization Pipeline for Conversational Indonesian Audio
By: Muhammad Daffa'i Rafi Prasetyo , Ramadhan Andika Putra , Zaidan Naufal Ilmi and more
Potential Business Impact:
Helps computers tell who is speaking in Indonesian.
This study presents a domain adaptation approach for speaker diarization targeting conversational Indonesian audio. We address the challenge of adapting an English-centric diarization pipeline to a low-resource language by employing synthetic data generation using neural Text-to-Speech technology. Experiments were conducted with varying training configurations, a small dataset (171 samples) and a large dataset containing 25 hours of synthetic speech. Results demonstrate that the baseline \texttt{pyannote/segmentation-3.0} model, trained on the AMI Corpus, achieves a Diarization Error Rate (DER) of 53.47\% when applied zero-shot to Indonesian. Domain adaptation significantly improves performance, with the small dataset models reducing DER to 34.31\% (1 epoch) and 34.81\% (2 epochs). The model trained on the 25-hour dataset achieves the best performance with a DER of 29.24\%, representing a 13.68\% absolute improvement over the baseline while maintaining 99.06\% Recall and 87.14\% F1-Score.
Similar Papers
Benchmarking Diarization Models
Sound
Helps computers know who is talking.
Pushing the Limits of End-to-End Diarization
Sound
Helps computers know who is talking when.
Zero-Shot to Zero-Lies: Detecting Bengali Deepfake Audio through Transfer Learning
Sound
Finds fake Bengali voices in audio recordings.