Score: 3

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Published: October 3, 2025 | arXiv ID: 2510.03185v1

By: Wanjia Zhao , Qinwei Ma , Jingzhe Shi and more

BigTech Affiliations: Stanford University

Potential Business Impact:

Tests how well computers solve physics problems.

Business Areas:

Quantum Computing Science and Engineering

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Computation and Language

Helps computers solve hard physics problems better.

31 Jul 2025 1

89%

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

Machine Learning (CS)

Tests if computers understand how things move.

10 Sep 2025 1

88%

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Computation and Language

Tests how well AI understands hard science problems.

22 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com github.com github.com github.com

Page Count

44 pages

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Tests how well computers solve physics problems.

Technical Abstract

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models