Score: 2

MSU-Bench: Towards Understanding the Conversational Multi-talker Scenarios

Published: August 11, 2025 | arXiv ID: 2508.08155v1

By: Shuai Wang , Zhaokai Sun , Zhennan Lin and more

Potential Business Impact:

Helps computers understand talking in noisy groups.

Spoken Language Understanding (SLU) has progressed from traditional single-task methods to large audio language model (LALM) solutions. Yet, most existing speech benchmarks focus on single-speaker or isolated tasks, overlooking the challenges posed by multi-speaker conversations that are common in real-world scenarios. We introduce MSU-Bench, a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. Our hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers. By evaluating state-of-the-art models on MSU-Bench, we demonstrate that as task complexity increases across the benchmark's tiers, all models exhibit a significant performance decline. We also observe a persistent capability gap between open-source models and closed-source commercial ones, particularly in multi-speaker interaction reasoning. These findings validate the effectiveness of MSU-Bench for assessing and advancing conversational understanding in realistic multi-speaker environments. Demos can be found in the supplementary material.

Country of Origin
🇨🇳 China

Repos / Data Links

Page Count
23 pages

Category
Electrical Engineering and Systems Science:
Audio and Speech Processing