Score: 1

PersRM-R1: Enhance Personalized Reward Modeling with Reinforcement Learning

Published: August 12, 2025 | arXiv ID: 2508.14076v1

By: Mengdi Li , Guanqiao Chen , Xufeng Zhao and more

Potential Business Impact:

Teaches AI to understand your personal likes.

Reward models (RMs), which are central to existing post-training methods, aim to align LLM outputs with human values by providing feedback signals during fine-tuning. However, existing RMs struggle to capture nuanced, user-specific preferences, especially under limited data and across diverse domains. Thus, we introduce PersRM-R1, the first reasoning-based reward modeling framework specifically designed to identify and represent personal factors from only one or a few personal exemplars. To address challenges including limited data availability and the requirement for robust generalization, our approach combines synthetic data generation with a two-stage training pipeline consisting of supervised fine-tuning followed by reinforcement fine-tuning. Experimental results demonstrate that PersRM-R1 outperforms existing models of similar size and matches the performance of much larger models in both accuracy and generalizability, paving the way for more effective personalized LLMs.