Reward Learning through Ranking Mean Squared Error
By: Chaitanya Kharyal, Calarina Muslimani, Matthew E. Taylor
Reward design remains a significant bottleneck in applying reinforcement learning (RL) to real-world problems. A popular alternative is reward learning, where reward functions are inferred from human feedback rather than manually specified. Recent work has proposed learning reward functions from human feedback in the form of ratings, rather than traditional binary preferences, enabling richer and potentially less cognitively demanding supervision. Building on this paradigm, we introduce a new rating-based RL method, Ranked Return Regression for RL (R4). At its core, R4 employs a novel ranking mean squared error (rMSE) loss, which treats teacher-provided ratings as ordinal targets. Our approach learns from a dataset of trajectory-rating pairs, where each trajectory is labeled with a discrete rating (e.g., "bad," "neutral," "good"). At each training step, we sample a set of trajectories, predict their returns, and rank them using a differentiable sorting operator (soft ranks). We then optimize a mean squared error loss between the resulting soft ranks and the teacher's ratings. Unlike prior rating-based approaches, R4 offers formal guarantees: its solution set is provably minimal and complete under mild assumptions. Empirically, using simulated human feedback, we demonstrate that R4 consistently matches or outperforms existing rating and preference-based RL methods on robotic locomotion benchmarks from OpenAI Gym and the DeepMind Control Suite, while requiring significantly less feedback.
Similar Papers
RL as Regressor: A Reinforcement Learning Approach for Function Approximation
Machine Learning (CS)
Trains predictions using custom game rewards
Reward Modeling from Natural Language Human Feedback
Computation and Language
Makes AI understand *why* it's right, not just guess.
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking
Machine Learning (CS)
Helps AI plan better by comparing options.