Linguistically Informed Tokenization Improves ASR for Underresourced Languages
By: Massimo Daul, Alessio Tosolini, Claire Bowern
Potential Business Impact:
Helps save old languages with smart computers.
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.
Similar Papers
Morphologically-Informed Tokenizers for Languages with Non-Concatenative Morphology: A case study of Yoloxóchtil Mixtec ASR
Computation and Language
Helps computers understand languages with tricky word parts.
How I Built ASR for Endangered Languages with a Spoken Dictionary
Computation and Language
Helps save dying languages with less speech data.
Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
Computation and Language
Lets computers understand rare languages better.