Score: 0

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Published: August 14, 2025 | arXiv ID: 2508.10770v1

By: Tiancheng Han , Yunfei Gao , Yong Li and more

Potential Business Impact:

Teaches computers to understand how things move.

Spatio-physical reasoning, a foundation capability for understanding the real physics world, is a critical step towards building robust world models. While recent vision language models (VLMs) have shown remarkable progress in specialized domains like multimodal mathematics and pure spatial understanding, their capability for spatio-physical reasoning remains largely unexplored. This paper provides a comprehensive diagnostic analysis of mainstream VLMs, revealing that current models perform inadequately on this crucial task. Further detailed analysis shows that this underperformance is largely attributable to biases caused by human-like prior and a lack of deep reasoning. To address these challenges, we apply supervised fine-tuning followed by rule-based reinforcement learning to Qwen2.5-VL-7B, resulting in significant improvements in spatio-physical reasoning capabilities and surpassing leading proprietary models. Nevertheless, despite this success, the model's generalization to new physics scenarios remains limited -- underscoring the pressing need for new approaches in spatio-physical reasoning.

Vision-Language Memory for Spatial Reasoning

CV and Pattern Recognition

Robots understand 3D space better from videos.

25 Nov 2025 0

91%

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Machine Learning (CS)

Tests if computers understand how things move.

10 Sep 2025 1

91%

SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data

CV and Pattern Recognition

Teaches computers to understand where things are.

29 Apr 2025 0

View PDF Login to Bookmark

Page Count

9 pages

From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models

Teaches computers to understand how things move.

Technical Abstract

Vision-Language Memory for Spatial Reasoning

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data