PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning
By: Mengdi Li , Guanqiao Chen , Xufeng Zhao and more
Potential Business Impact:
Teaches AI to understand your personal likes.
Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.
Similar Papers
RM-R1: Reward Modeling as Reasoning
Computation and Language
Makes AI explain its answers better.
Physics-Informed Reward Machines
Machine Learning (CS)
Teaches robots to learn faster by giving them goals.
PrLM: Learning Explicit Reasoning for Personalized RAG via Contrastive Reward Optimization
Information Retrieval
Teaches computers to understand what you like.