ArrowGEV: Grounding Events in Video via Learning the Arrow of Time
By: Fangxu Yu , Ziyao Lu , Liqiang Niu and more
Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.
Similar Papers
Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding
CV and Pattern Recognition
Finds video moments from text, even without seeing the future.
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding
CV and Pattern Recognition
Finds video moments described by words.
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding
CV and Pattern Recognition
Finds exact moments in videos using words.