Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics
By: Zena Al-Khalili, Nick Howell, Dietrich Klakow
Potential Business Impact:
Helps computers solve math problems more logically.
Assisting LLMs with code generation improved their performance on mathematical reasoning tasks. However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs. In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness. Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest. Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs' limits in the math domain.
Similar Papers
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach
Machine Learning (CS)
Makes AI better at solving math problems.
Evaluating the Generalization Capabilities of Large Language Models on Code Reasoning
Software Engineering
Helps computers understand and write computer code better.
How Does LLM Reasoning Work for Code? A Survey and a Call to Action
Software Engineering
Helps computers fix and write computer code.