Score: 0

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Published: April 24, 2025 | arXiv ID: 2504.17665v2

By: Zena Al-Khalili, Nick Howell, Dietrich Klakow

Potential Business Impact:

Helps computers solve math problems more logically.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.

Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

Machine Learning (CS)

Makes AI better at solving math problems.

13 Mar 2025 0

91%

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

Software Engineering

Helps computers understand and write computer code better.

7 Apr 2025 2

91%

How Does LLM Reasoning Work for Code? A Survey and a Call to Action

Software Engineering

Helps computers fix and write computer code.

16 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Page Count

18 pages

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics

Helps computers solve math problems more logically.

Technical Abstract

Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning

How Does LLM Reasoning Work for Code? A Survey and a Call to Action