Score: 0

Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Published: January 12, 2026 | arXiv ID: 2601.07274v1

By: Kalvin Chang , Yiwen Shao , Jiahong Li and more

Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at https://github.com/kalvinchang/yubao.

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

Computation and Language

Helps computers understand Chinese accents and dialects.

27 May 2025 2

90%

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Computation and Language

Tests how well computers understand spoken Chinese.

11 Nov 2025 2

88%

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Computation and Language

Helps voice assistants understand talking better.

30 Jul 2025 0

View PDF Login to Bookmark

Towards Comprehensive Semantic Speech Embeddings for Chinese Dialects

Technical Abstract

Leveraging LLM and Self-Supervised Training Models for Speech Recognition in Chinese Dialects: A Comparative Analysis

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations