Score: 1

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Published: December 1, 2025 | arXiv ID: 2512.02231v1

By: Le Thien Phuc Nguyen , Zhuoran Yu , Samuel Low Yu Hang and more

Potential Business Impact:

Helps computers understand who speaks in videos.

Business Areas:

Speech Recognition Data and Analytics, Software

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

CV and Pattern Recognition

Tests AI's ability to understand science videos.

9 Oct 2025 1

91%

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

Multimedia

AI understands feelings better from voices and faces.

8 Oct 2025 0

91%

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

CV and Pattern Recognition

Helps AI tell what sounds match what it sees.

13 Nov 2025 1

View PDF Login to Bookmark

Page Count

24 pages

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Helps computers understand who speaks in videos.

Technical Abstract

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

AV-EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Omni-modal LLMS with Audio-visual Cues

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?