CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching
By: Heyang Liu , Yuhao Wang , Ziyang Cheng and more
Potential Business Impact:
Helps computers understand mixed languages when talking.
The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at https://huggingface.co/datasets/VocalNet/CS3-Bench.
Similar Papers
VocalBench-zh: Decomposing and Benchmarking the Speech Conversational Abilities in Mandarin Context
Computation and Language
Tests how well computers understand spoken Chinese.
CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition
Computation and Language
Helps computers understand people speaking two languages.
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment
Artificial Intelligence
Helps computers find memory problems from talking.