Score: 0

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Published: November 24, 2025 | arXiv ID: 2511.18920v1

By: Wenhao Xu , Xin Dong , Yue Li and more

Potential Business Impact:

Makes video AI faster by skipping boring parts.

Business Areas:

Motion Capture Media and Entertainment, Video

Video large language models have demonstrated strong video understanding capabilities but suffer from high inference costs due to the massive number of tokens in long videos. Inspired by event-based vision, we propose an event-guided, training-free framework for efficient spatio-temporal understanding, named EventSTU. In the temporal domain, we design a coarse-to-fine keyframe sampling algorithm that exploits the change-triggered property of event cameras to eliminate redundant frames. In the spatial domain, we design an adaptive token pruning algorithm that leverages the visual saliency of events as a zero-cost prior to guide spatial reduction. From a holistic spatio-temporal perspective, we further integrate question relevance from keyframe sampling to adaptively allocate token pruning budgets. To facilitate evaluation, we construct EventBench, the first event-inclusive, human-annotated multimodal benchmark that covers diverse real-world scenarios. Beyond physical event cameras, EventSTU also supports general video understanding using simulated events. Comprehensive experiments show that EventSTU achieves 3.01x FLOPs reduction and 3.10x prefilling speedup over the strongest baseline while still improving performance.

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Multimedia

Helps apps understand traffic scenes better.

12 Nov 2025 0

89%

LET-US: Long Event-Text Understanding of Scenes

CV and Pattern Recognition

Lets computers understand long videos of light changes.

10 Aug 2025 2

88%

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

CV and Pattern Recognition

Helps computers understand videos with sound and movement.

21 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

11 pages

EventSTU: Event-Guided Efficient Spatio-Temporal Understanding for Video Large Language Models

Makes video AI faster by skipping boring parts.

Technical Abstract

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

LET-US: Long Event-Text Understanding of Scenes

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios