Towards Fine-Grained Video Question Answering
By: Wei Dai , Alan Luo , Zane Durante and more
Potential Business Impact:
Helps computers understand videos by watching objects and actions.
In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.
Similar Papers
VideoMultiAgents: A Multi-Agent Framework for Video Question Answering
CV and Pattern Recognition
Helps computers understand videos by thinking like a team.
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
CV and Pattern Recognition
Helps computers understand videos with text.
Advancing Egocentric Video Question Answering with Multimodal Large Language Models
CV and Pattern Recognition
Helps computers understand videos from a person's eyes.