Score: 0

An Empirical Study for Representations of Videos in Video Question Answering via MLLMs

Published: October 14, 2025 | arXiv ID: 2510.12299v1

By: Zhi Li , Yanan Wang , Hao Niu and more

Potential Business Impact:

Helps computers understand videos better and faster.

Business Areas:

Video Media and Entertainment, Video

Multimodal large language models have recently achieved remarkable progress in video question answering (VideoQA) by jointly processing visual, textual, and audio information. However, it remains unclear which video representations are most effective for MLLMs, and how different modalities balance task accuracy against computational efficiency. In this work, we present a comprehensive empirical study of video representation methods for VideoQA with MLLMs. We systematically evaluate single modality inputs question only, subtitles, visual frames, and audio signals as well as multimodal combinations, on two widely used benchmarks: VideoMME and LongVideoBench. Our results show that visual frames substantially enhance accuracy but impose heavy costs in GPU memory and inference latency, while subtitles provide a lightweight yet effective alternative, particularly for long videos. These findings highlight clear trade-offs between effectiveness and efficiency and provide practical insights for designing resource-aware MLLM-based VideoQA systems.