Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis
By: Yizhong Geng , Jizhuo Xu , Zeyu Liang and more
Potential Business Impact:
Makes computers speak any language, even rare ones.
Text-to-speech (TTS) technology has achieved impressive results for widely spoken languages, yet many under-resourced languages remain challenged by limited data and linguistic complexities. In this paper, we present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems for low-resource scenarios. We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed. Our method enables zero-shot voice cloning and improved performance across diverse client applications, ranging from finance to healthcare, education, and law. Extensive evaluations - both subjective and objective - confirm that our model meets state-of-the-art standards, offering a scalable solution for TTS production in data-limited settings, with significant implications for broader industry adoption and multilingual accessibility.
Similar Papers
Optimizing Multilingual Text-To-Speech with Accents & Emotions
Machine Learning (CS)
Makes computers speak with Indian accents and feelings.
Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation
Sound
Makes one voice talk in many languages.
Zero-Shot Text-to-Speech for Vietnamese
Computation and Language
Makes computers speak Vietnamese like people.