Score: 1

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Published: September 22, 2025 | arXiv ID: 2509.17677v1

By: Xiyuan Zhou , Xinlei Wang , Yirui He and more

Potential Business Impact:

Tests if computers can solve tricky real-world problems.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation -- they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model's robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Artificial Intelligence

Tests if AI can solve hard science problems.

14 Oct 2025 2

90%

EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Computation and Language

Tests if AI can solve hard engineering problems.

3 Nov 2025 0

90%

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs

Computational Engineering, Finance, and Science

AI can now design complex machines and systems.

1 Jul 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

29 pages

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Tests if computers can solve tricky real-world problems.

Technical Abstract

MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

EngChain: A Symbolic Benchmark for Verifiable Multi-Step Reasoning in Engineering

Toward Engineering AGI: Benchmarking the Engineering Design Capabilities of LLMs