Score: 1

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Published: March 17, 2025 | arXiv ID: 2503.13444v2

By: Ye Liu , Kevin Qinghong Lin , Chang Wen Chen and more

Potential Business Impact:

Helps computers understand videos by watching them.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

CV and Pattern Recognition

Helps computers understand long videos like people.

6 Jun 2025 1

91%

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

CV and Pattern Recognition

Helps computers understand long videos better.

6 Aug 2025 1

90%

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

CV and Pattern Recognition

Helps computers understand long videos better.

6 Aug 2025 1

View PDF Login to Bookmark

Page Count

16 pages

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Helps computers understand videos by watching them.

Technical Abstract

VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning