Score: 1

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

Published: March 14, 2025 | arXiv ID: 2503.11495v1

By: Zixu Cheng , Jian Hu , Ziquan Liu and more

Potential Business Impact:

Teaches computers to understand video actions like people.

Business Areas:

Image Recognition Data and Analytics, Software

Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

CV and Pattern Recognition

Teaches computers to understand videos like people.

16 Mar 2025 0

91%

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Artificial Intelligence

Helps computers understand how things move.

13 Oct 2025 1

91%

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task

CV and Pattern Recognition

Helps computers understand videos by seeing and thinking.

11 Dec 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 🇸🇬 🇬🇧 United Kingdom, Singapore, China

Page Count

10 pages

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

Teaches computers to understand video actions like people.

Technical Abstract

ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Tool-Augmented Spatiotemporal Reasoning for Streamlining Video Question Answering Task