How Far Do SSL Speech Models Listen for Tone? Temporal Focus of Tone Representation under Low-resource Transfer
By: Minu Kim, Ji Sub Um, Hoirin Kim
Potential Business Impact:
Helps computers understand talking with different tones.
Lexical tone is central to many languages but remains underexplored in self-supervised learning (SSL) speech models, especially beyond Mandarin. We study four languages with complex and diverse tone systems: Burmese, Thai, Lao, and Vietnamese, to examine how far such models listen for tone and how transfer operates in low-resource conditions. As a baseline reference, we estimate the temporal span of tone cues to be about 100 ms in Burmese and Thai, and about 180 ms in Lao and Vietnamese. Probes and gradient analyses on fine-tuned SSL models reveal that tone transfer varies by downstream task: automatic speech recognition fine-tuning aligns spans with language-specific tone cues, while prosody- and voice-related tasks bias the model toward overly long spans. These findings indicate that tone transfer is shaped by downstream task, highlighting task effects on temporal focus in tone modeling.
Similar Papers
LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning
Sound
Makes computers talk like people by reading lips.
Tone recognition in low-resource languages of North-East India: peeling the layers of SSL-based speech models
Audio and Speech Processing
Helps computers understand spoken words in rare languages.
Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages
Sound
Helps computers understand Thai speech better.