Evaluating Multimodal Large Language Models on Core Music Perception Tasks
By: Brandon James Carone, Iran R. Roman, Pablo Ripollés
Potential Business Impact:
Computers can't truly hear music, only read notes.
Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.
Similar Papers
The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS
Artificial Intelligence
Tests AI's ability to understand music.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand who speaks in videos.
Musical Score Understanding Benchmark: Evaluating Large Language Models' Comprehension of Complete Musical Scores
Sound
Helps computers understand music scores like a human.