Improving Video Question Answering through query-based frame selection
By: Himanshu Patil , Geo Jolly , Ramana Raja Buddala and more
Potential Business Impact:
Helps computers understand videos better by picking important scenes.
Video Question Answering (VideoQA) models enhance understanding and interaction with audiovisual content, making it more accessible, searchable, and useful for a wide range of fields such as education, surveillance, entertainment, and content creation. Due to heavy compute requirements, most large visual language models (VLMs) for VideoQA rely on a fixed number of frames by uniformly sampling the video. However, this process does not pick important frames or capture the context of the video. We present a novel query-based selection of frames relevant to the questions based on the submodular mutual Information (SMI) functions. By replacing uniform frame sampling with query-based selection, our method ensures that the chosen frames provide complementary and essential visual information for accurate VideoQA. We evaluate our approach on the MVBench dataset, which spans a diverse set of multi-action video tasks. VideoQA accuracy on this dataset was assessed using two VLMs, namely Video-LLaVA and LLaVA-NeXT, both of which originally employed uniform frame sampling. Experiments were conducted using both uniform and query-based sampling strategies. An accuracy improvement of up to \textbf{4\%} was observed when using query-based frame selection over uniform sampling. Qualitative analysis further highlights that query-based selection, using SMI functions, consistently picks frames better aligned with the question. We opine that such query-based frame selection can enhance accuracy in a wide range of tasks that rely on only a subset of video frames.
Similar Papers
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
CV and Pattern Recognition
Finds the most important moments in videos.
A.I.R.: Enabling Adaptive, Iterative, and Reasoning-based Frame Selection For Video Question Answering
CV and Pattern Recognition
Helps AI understand videos by picking key moments.
POVQA: Preference-Optimized Video Question Answering with Rationales for Data Efficiency
CV and Pattern Recognition
Lets computers understand long videos by summarizing them.