Mildly Conservative Regularized Evaluation for Offline Reinforcement Learning
By: Haohui Chen, Zhiyong Chen
Potential Business Impact:
Teaches computers to learn from old data safely.
Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without further environment interaction. A key challenge is the distribution shift between the learned and behavior policies, leading to out-of-distribution (OOD) actions and overestimation. To prevent gross overestimation, the value function must remain conservative; however, excessive conservatism may hinder performance improvement. To address this, we propose the mildly conservative regularized evaluation (MCRE) framework, which balances conservatism and performance by combining temporal difference (TD) error with a behavior cloning term in the Bellman backup. Building on this, we develop the mildly conservative regularized Q-learning (MCRQ) algorithm, which integrates MCRE into an off-policy actor-critic framework. Experiments show that MCRQ outperforms strong baselines and state-of-the-art offline RL algorithms on benchmark datasets.
Similar Papers
BiCQL-ML: A Bi-Level Conservative Q-Learning Framework for Maximum Likelihood Inverse Reinforcement Learning
Machine Learning (CS)
Teaches robots to learn from watching.
Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Data
Machine Learning (CS)
Teaches factories to make things faster.
Imagination-Limited Q-Learning for Offline Reinforcement Learning
Machine Learning (CS)
Teaches robots to learn from past mistakes.