Score: 1

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Published: November 26, 2025 | arXiv ID: 2511.21375v1

By: Xin Gu , Haoji Zhang , Qihang Fan and more

Potential Business Impact:

Helps computers find objects in videos using words.

Business Areas:

Motion Capture Media and Entertainment, Video

Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3\% m\_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

CV and Pattern Recognition

Helps computers find things in videos using words.

21 Nov 2025 0

92%

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

CV and Pattern Recognition

Finds objects in videos using text descriptions.

18 Sep 2025 2

91%

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

CV and Pattern Recognition

Finds exact moments in videos from descriptions.

19 Oct 2025 2

View PDF Login to Bookmark

Page Count

15 pages

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Helps computers find objects in videos using words.

Technical Abstract

OmniGround: A Comprehensive Spatio-Temporal Grounding Benchmark for Real-World Complex Scenarios

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs