Score: 1

Evaluation of LLMs for mathematical problem solving

Published: May 30, 2025 | arXiv ID: 2506.00309v3

By: Ruonan Wang , Runxi Wang , Yunwen Shen and more

Potential Business Impact:

Computers solve harder math problems better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. Gemini-2.0 shows strong linguistic understanding and clarity in well-structured problems but performs poorly in multi-step reasoning and symbolic logic. Our error analysis reveals particular deficits in each model: GPT-4o is at times lacking in sufficient explanation or precision; DeepSeek-V3 leaves out intermediate steps; and Gemini-2.0 is less flexible in mathematical reasoning in higher dimensions.

Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Artificial Intelligence

Tests AI math skills on tough competition problems.

30 Dec 2025 0

93%

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Artificial Intelligence

Helps AI tutors give better, personalized learning help.

2 Sep 2025 1

92%

An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3

Computation and Language

Computers write movie reviews that fool people.

30 May 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

23 pages

Evaluation of LLMs for mathematical problem solving

Computers solve harder math problems better.

Technical Abstract

Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

An evaluation of LLMs for generating movie reviews: GPT-4o, Gemini-2.0 and DeepSeek-V3