The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
By: Brandon James Carone, Iran R. Roman, Pablo Ripollés
Potential Business Impact:
Tests AI's ability to understand music.
Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.
Similar Papers
Evaluating Multimodal Large Language Models on Core Music Perception Tasks
Sound
Computers can't truly hear music, only read notes.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand who speaks in videos.
MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence
Audio and Speech Processing
Tests AI's ability to understand sounds and music.