Score: 1

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

Published: December 7, 2025 | arXiv ID: 2512.06673v1

By: Shida Gao , Feng Xue , Xiangfeng Wang and more

Potential Business Impact:

Finds objects in videos using smart language.

Business Areas:

Image Recognition Data and Analytics, Software

Spatio-temporal grounding and reasoning aims to locate the temporal segment and spatial region of an event in a video given a user query, while also reasoning about semantics such as causality, temporal order, and action relationships. To achieve this, current MLLMs primarily treats bounding boxes as text tokens and generates them autoregressively. However, such autoregressive spatial decoding leads to very-long output sequences, causing spatial errors to accumulated over time and the localization results to progressively drift across a video. To address this, we present a Detector-Empowered Video LLM, short for DEViL, which couples a Video LLM with an open-vocabulary detector (OVD). Specifically, the MLLM and detector are connected via a reference-semantic token (RST) that distills the user query into a rich semantic representation. Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding, enabling end-to-end learning of both referential understanding and spatial localization. Furthermore, we propose a tube-mined temporal regularization (TTReg) within OVD, which drives the OVD to generate temporally-consistent queries for target objects, thereby ensuring effective temporal association. Experiments demonstrate that DEViL achieves strong performance across various fine-grained video understanding tasks, particularly STVG and GroundedVQA. Code will be released on https://github.com/gaostar123/DeViL.

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

CV and Pattern Recognition

Helps computers find objects in videos using words.

26 Nov 2025 1

90%

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

CV and Pattern Recognition

Finds exact moments in videos from descriptions.

19 Oct 2025 2

90%

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability

CV and Pattern Recognition

Helps computers find objects in videos.

18 Mar 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

17 pages

1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning

Finds objects in videos using smart language.

Technical Abstract

Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability