Score: 2

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

Published: September 22, 2025 | arXiv ID: 2509.18004v1

By: Yuhang Dai , Ziyu Zhang , Shuai Wang and more

Potential Business Impact:

Helps computers understand Chinese dialects better.

Business Areas:

Semantic Web Internet Services

The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus's effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem

Sound

Helps computers understand a rare Chinese language.

16 Jan 2026 2

93%

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Sound

Makes computers understand and speak Cantonese better.

4 Sep 2025 3

93%

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

Sound

Makes computers understand and speak Cantonese better.

4 Sep 2025 3

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com github.com github.com

Page Count

5 pages

WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing

Helps computers understand Chinese dialects better.

Technical Abstract

WenetSpeech-Wu: Datasets, Benchmarks, and Models for a Unified Chinese Wu Dialect Speech Processing Ecosystem

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation

WenetSpeech-Yue: A Large-scale Cantonese Speech Corpus with Multi-dimensional Annotation