DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding
By: Ruiyi Zhang , Peijia Qin , Qi Cao and more
Potential Business Impact:
Helps computers write better code by breaking it down.
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
Similar Papers
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
Computation and Language
Helps AI make better choices step-by-step.
FreePRM: Training Process Reward Models Without Ground Truth Process Labels
Computation and Language
Teaches AI to learn without needing every step.
VRPRM: Process Reward Modeling via Visual Reasoning
Machine Learning (CS)
Teaches computers to think better with less data.