Score: 1

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Published: August 12, 2025 | arXiv ID: 2508.08833v1

By: Yuren Hao, Xiang Wan, Chengxiang Zhai

BigTech Affiliations: Stanford University

Potential Business Impact:

Tests if AI can do math, even when words change.

In this paper, we introduce a systematic framework beyond conventional method to assess LLMs' mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI's flagship reasoning model, O3, scores 49 % on the originals but drops by 4 percentage points on surface variants, and by 10.5 percentage points on core-step-based variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Computation and Language

Makes math AI understand rephrased problems better.

7 Oct 2025 1

92%

Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Artificial Intelligence

Tests AI math skills on tough competition problems.

30 Dec 2025 0

91%

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Machine Learning (CS)

Helps computers solve tough math puzzles better.

14 Oct 2025 3

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

16 pages

An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Tests if AI can do math, even when words change.

Technical Abstract

MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems