Repairing Reward Functions with Human Feedback to Mitigate Reward Hacking
By: Stephane Hatgis-Kessell, Logan Mondal Bhamidipaty, Emma Brunskill
Potential Business Impact:
Fixes computer goals to match what people want.
Human-designed reward functions for reinforcement learning (RL) agents are frequently misaligned with the humans' true, unobservable objectives, and thus act only as proxies. Optimizing for a misspecified proxy reward function often induces reward hacking, resulting in a policy misaligned with the human's true objectives. An alternative is to perform RL from human feedback, which involves learning a reward function from scratch by collecting human preferences over pairs of trajectories. However, building such datasets is costly. To address the limitations of both approaches, we propose Preference-Based Reward Repair (PBRR): an automated iterative framework that repairs a human-specified proxy reward function by learning an additive, transition-dependent correction term from preferences. A manually specified reward function can yield policies that are highly suboptimal under the ground-truth objective, yet corrections on only a few transitions may suffice to recover optimal performance. To identify and correct for those transitions, PBRR uses a targeted exploration strategy and a new preference-learning objective. We prove in tabular domains PBRR has a cumulative regret that matches, up to constants, that of prior preference-based RL methods. In addition, on a suite of reward-hacking benchmarks, PBRR consistently outperforms baselines that learn a reward function from scratch from preferences or modify the proxy reward function using other approaches, requiring substantially fewer preferences to learn high performing policies.
Similar Papers
Rectifying Shortcut Behaviors in Preference-based Reward Learning
Artificial Intelligence
Teaches AI to follow instructions, not cheat.
Mitigating Preference Hacking in Policy Optimization with Pessimism
Machine Learning (CS)
Teaches AI to follow rules without cheating.
Learning Real-World Acrobatic Flight from Human Preferences
Robotics
Teaches drones to fly fancy tricks by watching.