Score: 1

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Published: December 13, 2025 | arXiv ID: 2512.12360v1

By: Yufei Yin , Qianke Meng , Minghao Chen and more

Potential Business Impact:

Helps computers understand long videos faster.

Business Areas:

Image Recognition Data and Analytics, Software

Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

CV and Pattern Recognition

Helps computers understand videos like people do.

18 Nov 2025 0

91%

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

CV and Pattern Recognition

Lets computers understand very long videos better.

2 Dec 2025 1

90%

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management

CV and Pattern Recognition

Lets computers watch and remember long videos.

4 Dec 2025 0

View PDF Login to Bookmark

Page Count

15 pages

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

Helps computers understand long videos faster.

Technical Abstract

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

WorldMM: Dynamic Multimodal Memory Agent for Long Video Reasoning

VideoMem: Enhancing Ultra-Long Video Understanding via Adaptive Memory Management