An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
By: Zhi Li , Yanan Wang , Hao Niu and more
Potential Business Impact:
Helps computers understand videos better and faster.
Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.
Similar Papers
FinCap: Topic-Aligned Captions for Short-Form Financial YouTube Videos
CV and Pattern Recognition
Helps computers understand money videos by watching and listening.
An Empirical Study on How Video-LLMs Answer Video Questions
CV and Pattern Recognition
Explains how AI understands videos to make them faster.
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
Information Retrieval
Helps video apps understand what you *really* like.