Score: 0

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Published: December 5, 2025 | arXiv ID: 2512.05954v1

By: Shima Imani , Seungwhan Moon , Adel Ahmadyan and more

We introduce, a large-scale synthetic benchmark of 15,045 university-level physics problems (90/10% train/test split). Each problem is fully parameterized, supporting an effectively infinite range of input configurations, and is accompanied by structured, step-by-step reasoning and executable Python code that produces the ground-truth solution for any parameter set. The benchmark contains three question types: MC-Symbolic (multiple-choice with symbolic options), MC-Numerical (multiple-choice with numerical options), and free-form (open-ended responses). These diverse formats test complementary reasoning skills. By leveraging the dynamic, code-driven nature of the benchmark, we introduce three novel evaluation metrics in addition to standard accuracy: Consistency Score, Failure Rate, and Confusion Rate, that quantify variability and uncertainty across problem variants. Experiments with state-of-the-art instruction-tuned language models reveal both strengths and limitations in scientific reasoning, positioning SymPyBench as a foundation for developing more robust and interpretable reasoning systems

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Computation and Language

Tests how well AI understands hard science problems.

22 Apr 2025 1

88%

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

Artificial Intelligence

Tests how smart AI can solve hard problems.

16 Oct 2025 1

88%

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Artificial Intelligence

Tests if AI can solve hard science problems.

14 Oct 2025 2

View PDF Login to Bookmark

SymPyBench: A Dynamic Benchmark for Scientific Reasoning with Executable Python Code

Technical Abstract

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

MorphoBench: A Benchmark with Difficulty Adaptive to Model Reasoning

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science