MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios
By: Shuai Wang , Zhaokai Sun , Zhennan Lin and more
Potential Business Impact:
Helps computers understand talking in noisy groups.
Spoken Language Understanding (SLU) has progressed from traditional single-task methods to large audio language model (LALM) solutions. Yet, most existing speech benchmarks focus on single-speaker or isolated tasks, overlooking the challenges posed by multi-speaker conversations that are common in real-world scenarios. We introduce MSU-Bench, a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. Our hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers. By evaluating state-of-the-art models on MSU-Bench, we demonstrate that as task complexity increases across the benchmark's tiers, all models exhibit a significant performance decline. We also observe a persistent capability gap between open-source models and closed-source commercial ones, particularly in multi-speaker interaction reasoning. These findings validate the effectiveness of MSU-Bench for assessing and advancing conversational understanding in realistic multi-speaker environments. Demos can be found in the supplementary material.
Similar Papers
MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Computation and Language
Helps computers understand emotions and meaning in speech.
M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models
Computation and Language
Helps computers know who said what in talks.
MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols
Computation and Language
Tests how well computers understand talking.