Guardian: Decoupling Exploration from Safety in Reinforcement Learning
By: Kaitong Cai , Jusheng Zhang , Jing Yang and more
Potential Business Impact:
Teaches robots to learn safely and quickly.
Hybrid offline--online reinforcement learning (O2O RL) promises both sample efficiency and robust exploration, but suffers from instability due to distribution shift between offline and online data. We introduce RLPD-GX, a framework that decouples policy optimization from safety enforcement: a reward-seeking learner explores freely, while a projection-based guardian guarantees rule-consistent execution and safe value backups. This design preserves the exploratory value of online interactions without collapsing to conservative policies. To further stabilize training, we propose dynamic curricula that gradually extend temporal horizons and anneal offline--online data mixing. We prove convergence via a contraction property of the guarded Bellman operator, and empirically show state-of-the-art performance on Atari-100k, achieving a normalized mean score of 3.02 (+45\% over prior hybrid methods) with stronger safety and stability. Beyond Atari, ablations demonstrate consistent gains across safety-critical and long-horizon tasks, underscoring the generality of our design. Extensive and comprehensive results highlight decoupled safety enforcement as a simple yet principled route to robust O2O RL, suggesting a broader paradigm for reconciling exploration and safety in reinforcement learning.
Similar Papers
Online Optimization for Offline Safe Reinforcement Learning
Machine Learning (CS)
Teaches robots to do tasks safely and well.
GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies
Machine Learning (CS)
Lets robots learn complex moves safely and quickly.
MOORL: A Framework for Integrating Offline-Online Reinforcement Learning
Machine Learning (CS)
Teaches robots to learn from past mistakes.