Score: 1

From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling

Published: May 24, 2025 | arXiv ID: 2506.00027v1

By: Zhengyu Chen , Yudong Wang , Teng Xiao and more

BigTech Affiliations: Meituan

Potential Business Impact:

Helps computers solve problems better with feedback.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Recent advancements in improving the reasoning capabilities of Large Language Models have underscored the efficacy of Process Reward Models (PRMs) in addressing intermediate errors through structured feedback mechanisms. This study analyzes PRMs from multiple perspectives, including training methodologies, scalability, and generalization capabilities. We investigate the interplay between pre-training and reward model training FLOPs to assess their influence on PRM efficiency and accuracy in complex reasoning tasks. Our analysis reveals a pattern of diminishing returns in performance with increasing PRM scale, highlighting the importance of balancing model size and computational cost. Furthermore, the diversity of training datasets significantly impacts PRM performance, emphasizing the importance of diverse data to enhance both accuracy and efficiency. We further examine test-time scaling strategies, identifying Monte Carlo Tree Search as the most effective method when computational resources are abundant, while Best-of-N Sampling serves as a practical alternative under resource-limited conditions. Notably, our findings indicate that PRMs trained on mathematical datasets exhibit performance comparable to those tailored for code generation, suggesting robust cross-domain generalization. Employing a gradient-based metric, we observe that PRMs exhibit a preference for selecting responses with similar underlying patterns, further informing their optimization.

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Computation and Language

Makes AI better at solving math problems.

1 Apr 2025 2

93%

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Computation and Language

Teaches computers to think step-by-step.

9 Oct 2025 0

91%

Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners

Computation and Language

Teaches computers to solve problems step-by-step.

2 Mar 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Page Count

14 pages

From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling

Helps computers solve problems better with feedback.

Technical Abstract

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Rewarding Graph Reasoning Process makes LLMs more Generalized Reasoners