Score: 0

ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Published: December 27, 2025 | arXiv ID: 2512.22491v1

By: Suhua Wang , Zifan Wang , Xiaoxin Sun and more

Potential Business Impact:

Makes computers speak endangered Manchu language.

Business Areas:

Translation Service Professional Services

As an endangered language, Manchu presents unique challenges for speech synthesis, including severe data scarcity and strong phonological agglutination. This paper proposes ManchuTTS(Manchu Text to Speech), a novel approach tailored to Manchu's linguistic characteristics. To handle agglutination, this method designs a three-tier text representation (phoneme, syllable, prosodic) and a cross-modal hierarchical attention mechanism for multi-granular alignment. The synthesis model integrates deep convolutional networks with a flow-matching Transformer, enabling efficient, non-autoregressive generation. This method further introduce a hierarchical contrastive loss to guide structured acoustic-linguistic correspondence. To address low-resource constraints, This method construct the first Manchu TTS dataset and employ a data augmentation strategy. Experiments demonstrate that ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset derived from our full 6.24-hour annotated corpus, outperforming all baseline models by a notable margin. Ablations confirm hierarchical guidance improves agglutinative word pronunciation accuracy (AWPA) by 31% and prosodic naturalness by 27%.

Text to Speech System for Meitei Mayek Script

Computation and Language

Lets computers speak the Manipuri language.

9 Aug 2025 1

87%

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Computation and Language

Makes computers speak all Tibetan dialects.

22 Sep 2025 0

87%

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

Sound

Makes voices sound like anyone, in any language.

18 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

7 pages

ManchuTTS: Towards High-Quality Manchu Speech Synthesis via Flow Matching and Hierarchical Text Representation

Makes computers speak endangered Manchu language.

Technical Abstract

Text to Speech System for Meitei Mayek Script

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis