A Video Is Not Worth a Thousand Words
By: Sam Pollard, Michael Wray
Potential Business Impact:
Shows how AI understands videos and text.
As we become increasingly dependent on vision language models (VLMs) to answer questions about the world around us, there is a significant amount of research devoted to increasing both the difficulty of video question answering (VQA) datasets, and the context lengths of the models that they evaluate. The reliance on large language models as backbones has lead to concerns about potential text dominance, and the exploration of interactions between modalities is underdeveloped. How do we measure whether we're heading in the right direction, with the complexity that multi-modal models introduce? We propose a joint method of computing both feature attributions and modality scores based on Shapley values, where both the features and modalities are arbitrarily definable. Using these metrics, we compare $6$ VLM models of varying context lengths on $4$ representative datasets, focusing on multiple-choice VQA. In particular, we consider video frames and whole textual elements as equal features in the hierarchy, and the multiple-choice VQA task as an interaction between three modalities: video, question and answer. Our results demonstrate a dependence on text and show that the multiple-choice VQA task devolves into a model's ability to ignore distractors. Code available at https://github.com/sjpollard/a-video-is-not-worth-a-thousand-words.
Similar Papers
FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
CV and Pattern Recognition
Helps computers understand money videos by watching and listening.
Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
CV and Pattern Recognition
Makes computers judge video quality better, faster.
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
Information Retrieval
Helps computers understand videos better and faster.