Score: 1

T*: Re-thinking Temporal Search for Long-Form Video Understanding

Published: April 3, 2025 | arXiv ID: 2504.02259v3

By: Jinhui Ye , Zihan Wang , Haosen Sun and more

Potential Business Impact:

Helps computers understand long videos faster.

Business Areas:

Semantic Search Internet Services

Efficiently understanding long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding and address a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). Our contributions are twofold: First, we frame temporal search as a Long Video Haystack problem: finding a minimal set of relevant frames (e.g., one to five) from tens of thousands based on specific queries. Upon this formulation, we introduce LV-Haystack, the first dataset with 480 hours of videos, 15,092 human-annotated instances for both training and evaluation aiming to improve temporal search quality and efficiency. Results on LV-Haystack highlight a significant research gap in temporal search capabilities, with current SOTA search methods only achieving 2.1% temporal F1 score on the Longvideobench subset. Next, inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search. T* leverages powerful visual localization techniques commonly used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding. Under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-OV-72B's performance from 56.5% to 62.4% on the Longvideobench XL subset. Our code, benchmark, and models are provided in the Supplementary material.

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

CV and Pattern Recognition

Helps computers understand long videos better.

2 Apr 2025 2

89%

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

CV and Pattern Recognition

Finds important video parts faster for understanding.

7 Nov 2025 3

89%

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

CV and Pattern Recognition

Finds important video moments for questions.

17 Mar 2025 1

View PDF Login to Bookmark

Page Count

22 pages

T*: Re-thinking Temporal Search for Long-Form Video Understanding

Helps computers understand long videos faster.

Technical Abstract

TimeSearch: Hierarchical Video Search with Spotlight and Reflection for Human-like Long Video Understanding

TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding