VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents
By: Jiliang Hu , Wenfu Wang , Zuchao Li and more
Potential Business Impact:
Tests how well AI understands spoken Chinese.
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
Similar Papers
VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
Computation and Language
Tests how well computers understand spoken Chinese.
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
Artificial Intelligence
Helps computers find memory problems from talking.
SOVA-Bench: Benchmarking the Speech Conversation Ability for LLM-based Voice Assistant
Sound
Tests how well AI talks like a real person.