Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach
By: Afrar Jahin , Arif Hassan Zidan , Wei Zhang and more
Potential Business Impact:
Makes AI better at solving math problems.
With the rapid advancement of Artificial Intelligence (AI), Large Language Models (LLMs) have significantly impacted a wide array of domains, including healthcare, engineering, science, education, and mathematical reasoning. Among these, mathematical reasoning remains a particularly challenging capability, often requiring multi-step logic and abstract generalization. While prior work has explored LLM performance on reasoning tasks, comprehensive evaluations that span both depth and breadth across model families remain limited. In this study, we present a systematic evaluation of mathematical reasoning abilities across eight leading LLMs, including two recent DeepSeek models, using three independent benchmark datasets. Our analyses reveal several key findings: (1) DeepSeek-R1 performs competitively with o1 across most domains and achieves the highest accuracy on the MMLU Formal Logic benchmark; (2) distilled variants, such as DeepSeek-1.5B, exhibit substantial performance degradation; and (3) Gemini 2.0 Flash achieves the lowest response latency. Beyond quantitative metrics, we explore how architectural choices, training paradigms, and optimization strategies contribute to variation in reasoning performance. These findings provide new insights into the capabilities and limitations of current LLMs in mathematical domains, and offer guidance for the development of future models better aligned with rigorous reasoning demands.
Similar Papers
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases
Computation and Language
Tests AI doctors' thinking for better patient care.
A Survey on Large Language Models for Mathematical Reasoning
Artificial Intelligence
Helps computers solve math problems like a person.
Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning
Artificial Intelligence
Tests if AI can think like a person.