Team of One: Cracking Complex Video QA with Model Synergy
By: Jun Xie , Zhaoran Zhao , Xiongjun Guan and more
Potential Business Impact:
Helps computers understand videos better.
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios, as benchmarked on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries. To address these challenges, we introduce a prompting-and-response integration mechanism that coordinates multiple heterogeneous Video-Language Models (VLMs) via structured chains of thought, each tailored to distinct reasoning pathways. An external Large Language Model (LLM) serves as an evaluator and integrator, selecting and fusing the most reliable responses. Extensive experiments demonstrate that our method significantly outperforms existing baselines across all evaluation metrics, showcasing superior generalization and robustness. Our approach offers a lightweight, extensible strategy for advancing multimodal reasoning without requiring model retraining, setting a strong foundation for future Video-LMM development.
Similar Papers
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?
CV and Pattern Recognition
Helps computers understand videos by thinking step-by-step.
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand videos from a person's eyes.
CVBench: Evaluating Cross-Video Synergies for Complex Multimodal Understanding and Reasoning
CV and Pattern Recognition
Helps computers understand many videos together.