Score: 1

Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation

Published: March 12, 2025 | arXiv ID: 2503.10691v2

By: Qiji Zhou , Yifan Gong , Guangsheng Bao and more

Potential Business Impact:

Helps computers understand videos by asking questions.

Business Areas:
Image Recognition Data and Analytics, Software

Counterfactual reasoning is crucial for robust video understanding but remains underexplored in existing multimodal benchmarks. In this paper, we introduce \textbf{COVER} (\textbf{\underline{CO}}unterfactual \textbf{\underline{V}}id\textbf{\underline{E}}o \textbf{\underline{R}}easoning), a multidimensional multimodal benchmark that systematically evaluates MLLMs across the abstract-concrete and perception-cognition dimensions. Beyond prior multimodal benchmarks, COVER decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis. Experiments on commercial and open-source models reveal a strong correlation between sub-question accuracy and counterfactual reasoning performance, highlighting the role of structured inference in video understanding. Furthermore, our results suggest a key insight: enhancing the reasoning capability of models is essential for improving the robustness of video understanding. COVER establishes a new standard for assessing MLLMs' logical reasoning abilities in dynamic environments. Our work is available at https://github.com/gongyifan-hash/COVER-Benchmark.

Repos / Data Links

Page Count
19 pages

Category
Computer Science:
CV and Pattern Recognition