Score: 2

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Published: October 20, 2025 | arXiv ID: 2510.17722v1

By: Yaning Pan , Zekun Wang , Qianqian Xie and more

Potential Business Impact:

Tests AI's ability to talk about videos.

Business Areas:

Video Chat Information Technology, Internet Services, Messaging and Telecommunications

The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI's ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses six core competencies that focus on perceptivity and interactivity, encompassing 987 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

CV and Pattern Recognition

Tests AI's skill with many videos at once.

10 Nov 2025 2

91%

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

CV and Pattern Recognition

Tests AI's ability to watch and understand videos.

4 May 2025 0

91%

Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

CV and Pattern Recognition

Tests AI that watches videos for safety.

14 Jun 2025 2

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com github.com

Page Count

28 pages

MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Tests AI's ability to talk about videos.

Technical Abstract

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Understanding and Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding