Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding
By: Hong Gao , Yiming Bao , Xuezhen Tu and more
Potential Business Impact:
Helps computers understand videos like people do.
Video understanding requires not only visual recognition but also complex reasoning. While Vision-Language Models (VLMs) demonstrate impressive capabilities, they typically process videos largely in a single-pass manner with limited support for evidence revisit and iterative refinement. While recently emerging agent-based methods enable long-horizon reasoning, they either depend heavily on expensive proprietary models or require extensive agentic RL training. To overcome these limitations, we propose Agentic Video Intelligence (AVI), a flexible and training-free framework that can mirror human video comprehension through system-level design and optimization. AVI introduces three key innovations: (1) a human-inspired three-phase reasoning process (Retrieve-Perceive-Review) that ensures both sufficient global exploration and focused local analysis, (2) a structured video knowledge base organized through entity graphs, along with multi-granularity integrated tools, constituting the agent's interaction environment, and (3) an open-source model ensemble combining reasoning LLMs with lightweight base CV models and VLM, eliminating dependence on proprietary APIs or RL training. Experiments on LVBench, VideoMME-Long, LongVideoBench, and Charades-STA demonstrate that AVI achieves competitive performance while offering superior interpretability.
Similar Papers
Empowering Agentic Video Analytics Systems with Video Language Models
CV and Pattern Recognition
Lets computers understand very long videos.
CAViAR: Critic-Augmented Video Agentic Reasoning
CV and Pattern Recognition
Lets computers understand long, tricky videos better.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
CV and Pattern Recognition
Helps computers understand long videos better.