Score: 0

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Published: January 10, 2026 | arXiv ID: 2601.06559v1

By: Fangxu Yu , Ziyao Lu , Liqiang Niu and more

Grounding events in videos serves as a fundamental capability in video analysis. While Vision-Language Models (VLMs) are increasingly employed for this task, existing approaches predominantly train models to associate events with timestamps in the forward video only. This paradigm hinders VLMs from capturing the inherent temporal structure and directionality of events, thereby limiting robustness and generalization. To address this limitation, inspired by the arrow of time in physics, which characterizes the intrinsic directionality of temporal processes, we propose ArrowGEV, a reinforcement learning framework that explicitly models temporal directionality in events to improve both event grounding and temporal directionality understanding in VLMs. Specifically, we categorize events into time-sensitive (e.g., putting down a bag) and time-insensitive (e.g., holding a towel in the left hand). The former denote events whose reversal substantially alters their meaning, while the latter remain semantically unchanged under reversal. For time-sensitive events, ArrowGEV introduces a reward that encourages VLMs to discriminate between forward and backward videos, whereas for time-insensitive events, it enforces consistent grounding across both directions. Extensive experiments demonstrate that ArrowGEV not only improves grounding precision and temporal directionality recognition, but also enhances general video understanding and reasoning ability.

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

CV and Pattern Recognition

Finds video moments from text, even without seeing the future.

6 Aug 2025 3

89%

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

CV and Pattern Recognition

Finds video moments described by words.

3 Aug 2025 1

89%

TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding

CV and Pattern Recognition

Finds exact moments in videos using words.

11 Aug 2025 1

View PDF Login to Bookmark

ArrowGEV: Grounding Events in Video via Learning the Arrow of Time

Technical Abstract

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding