Score: 0

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Published: December 22, 2025 | arXiv ID: 2512.19526v1

By: Li Puyin , Tiange Xiang , Ella Mao and more

Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Artificial Intelligence

Teaches computers to understand how things move.

7 Aug 2025 2

92%

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

CV and Pattern Recognition

Teaches robots to understand how things move.

27 Jan 2025 1

91%

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Machine Learning (CS)

Tests if computers understand how things move.

10 Sep 2025 1

View PDF Login to Bookmark

QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

Technical Abstract

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models