Efficient Process Reward Model Training via Active Learning
By: Keyu Duan , Zichen Liu , Xin Mao and more
Potential Business Impact:
Teaches computers to learn faster with less work.
Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.
Similar Papers
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
Computation and Language
Teaches AI to learn without needing every step.
Adversarial Training for Process Reward Models
Machine Learning (CS)
Teaches AI to find and fix its own mistakes.
An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning
Computation and Language
Teaches computers to solve math problems better.