Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning
By: Jiaxing Guo , Wenjie Yang , Shengzhong Zhang and more
Potential Business Impact:
Finds math mistakes in computer thinking.
Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs' answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to baselines, especially for complex, multi-step problems. This offers a more robust path towards evaluating and training LLMs with genuine mathematical reasoning.
Similar Papers
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Artificial Intelligence
Teaches computers to prove math problems step-by-step.
Long-horizon Reasoning Agent for Olympiad-Level Mathematical Problem Solving
Computation and Language
Checks AI's thinking steps for mistakes.
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Artificial Intelligence
Computers can't truly do hard math problems.