Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models
By: Jinhwan Seo , Yoonki Cho , Junhyug Noh and more
Potential Business Impact:
Helps computers understand videos by finding key moments.
In this technical report, we introduce a framework to address Grounded Video Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The GVQA task demands robust multimodal models capable of complex reasoning over video content, grounding the resulting answers visually, and tracking the referenced objects temporally. To achieve this capability, our proposed approach decomposes the GVQA task into a three-stage pipeline: (1) Video Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key contribution is the introduction of a trigger moment, derived from our proposed CORTEX prompt, which pinpoints the single most visible frame of a target object to serve as a robust anchor for grounding and tracking. To this end, we achieve the HOTA score of 0.4968, which marks a significant improvement over the previous year's winning score of 0.2704 on GVQA task.
Similar Papers
Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task
CV and Pattern Recognition
Helps computers understand videos by seeing and thinking.
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
CV and Pattern Recognition
Helps computers find the right video moments to answer questions.
Moment Quantization for Video Temporal Grounding
CV and Pattern Recognition
Finds the right video clips for a description.