Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection
By: Sadegh Mahdavi , Branislav Kisacanin , Shubham Toshniwal and more
Potential Business Impact:
Helps computers check math proofs for mistakes.
Large language models have achieved remarkable success on final-answer mathematical problems, largely due to the ease of applying reinforcement learning with verifiable rewards. However, the reasoning underlying these solutions is often flawed. Advancing to rigorous proof-based mathematics requires reliable proof verification capabilities. We begin by analyzing multiple evaluation setups and show that focusing on a single benchmark can lead to brittle or misleading conclusions. To address this, we evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance. We then scale two major generative verification methods (GenSelect and LLM-as-a-Judge) to millions of tokens and identify their combination as the most effective framework for solution verification and selection. We further show that the choice of prompt for LLM-as-a-Judge significantly affects the model's performance, but reinforcement learning can reduce this sensitivity. However, despite improving proof-level metrics, reinforcement learning does not enhance final-answer precision, indicating that current models often reward stylistic or procedural correctness rather than mathematical validity. Our results establish practical guidelines for designing and evaluating scalable proof-verification and selection systems.
Similar Papers
DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning
Artificial Intelligence
Teaches computers to prove math problems step-by-step.
Incentivizing LLMs to Self-Verify Their Answers
Machine Learning (CS)
Helps computers check their own math answers.
From Solving to Verifying: A Unified Objective for Robust Reasoning in LLMs
Machine Learning (CS)
Helps AI check its own thinking better.