Score: 2

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Published: November 11, 2025 | arXiv ID: 2511.08230v1

By: Heyang Liu , Ziyang Cheng , Yuhao Wang and more

Potential Business Impact:

Tests how well computers understand spoken Chinese.

Business Areas:

Speech Recognition Data and Analytics, Software

The development of multi-modal large language models (LLMs) leads to intelligent approaches capable of speech interactions. As one of the most widely spoken languages globally, Mandarin is supported by most models to enhance their applicability and reach. However, the scarcity of comprehensive speech-to-speech (S2S) benchmarks in Mandarin contexts impedes systematic evaluation for developers and hinders fair model comparison for users. In this work, we propose VocalBench-zh, an ability-level divided evaluation suite adapted to Mandarin context consisting of 10 well-crafted subsets and over 10K high-quality instances, covering 12 user-oriented characters. The evaluation experiment on 14 mainstream models reveals the common challenges for current routes, and highlights the need for new insights into next-generation speech interactive systems. The evaluation codes and datasets will be available at https://github.com/SJTU-OmniAgent/VocalBench-zh.

VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

Sound

Tests how well AI understands spoken Chinese.

13 Oct 2025 0

91%

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Computation and Language

Helps computers understand mixed languages when talking.

9 Oct 2025 0

90%

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Computation and Language

Tests AI to see if it's safe for doctors.

18 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com

Page Count

38 pages

VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context

Tests how well computers understand spoken Chinese.

Technical Abstract

VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents