PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards
By: Mukesh Ghimire , Aosong Feng , Liwen You and more
Potential Business Impact:
Teaches computers to learn better without answers.
Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model's consistency, either by majority voting or by converting the model's internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model's internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model's internal confidence in check.
Similar Papers
A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models
Computation and Language
Teaches computers to think step-by-step.
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling
Computation and Language
Makes AI solve hard problems in many languages.
PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection
CV and Pattern Recognition
Makes AI learn faster and better from pictures.