Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets
By: Adam Younsi , Abdalgader Abubaker , Mohamed El Amine Seddik and more
Potential Business Impact:
Teaches computers to solve math problems better.
Achieving both accuracy and diverse reasoning remains challenging for Large Language Models (LLMs) in complex domains like mathematics. A key bottleneck is evaluating intermediate reasoning steps to guide generation without costly human annotations. To address this, we first introduce a novel Process Reward Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a similarity-based data augmentation technique, effectively capturing step-level reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks (GFlowNets) to operate at the reasoning step level. Unlike traditional reinforcement learning focused on maximizing a single reward, GFlowNets naturally sample diverse, high-quality solutions proportional to their rewards, as measured by our PRM. Empirical evaluation shows strong improvements in both accuracy and solution diversity on challenging mathematical benchmarks (e.g., +2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective generalization to unseen datasets (+9.4% absolute on SAT MATH). Our work demonstrates the potential of PRM-guided, step-level GFlowNets for developing more robust and versatile mathematical reasoning in LLMs.
Similar Papers
Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners
Computation and Language
Teaches computers to solve problems step-by-step.
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Computation and Language
Fixes math problems by explaining each step.
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Computation and Language
Fixes math problems by explaining each step.