Score: 1

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Published: August 6, 2025 | arXiv ID: 2508.04416v1

By: Haoji Zhang , Xin Gu , Jiawen Li and more

Potential Business Impact:

Helps computers understand long videos better.

Plain English Summary

Imagine a smart assistant that can understand what's happening in a video, even if it's really long. This new system is like giving that assistant a special set of tools to look at videos frame by frame, helping it understand complex actions and answer questions more accurately. This means you could get better answers from AI about videos, whether it's for entertainment, education, or even helping with tasks like identifying specific moments in security footage.

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. All code, data and model weight will be made publicly available.

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

CV and Pattern Recognition

Helps computers understand long videos better.

6 Aug 2025 1

92%

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

CV and Pattern Recognition

Helps computers understand videos by watching carefully.

28 Nov 2025 1

92%

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning

CV and Pattern Recognition

Helps computers understand videos by watching them.

16 Oct 2025 2

View PDF Login to Bookmark

Page Count

22 pages

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Helps computers understand long videos better.

Plain English Summary

Technical Abstract

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning