A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
By: Congming Zheng , Jiachen Zhu , Zhuoying Ou and more
Potential Business Impact:
Teaches computers to think step-by-step.
Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
Similar Papers
The Bidirectional Process Reward Model
Computation and Language
Helps AI check its thinking both ways.
From Mathematical Reasoning to Code: Generalization of Process Reward Models in Test-Time Scaling
Computation and Language
Helps computers solve problems better with feedback.
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
Computation and Language
Makes AI understand and follow instructions better.