Q-Learning with Shift-Aware Upper Confidence Bound in Non-Stationary Reinforcement Learning
By: Ha Manh Bui , Felix Parker , Kimia Ghobadi and more
Potential Business Impact:
Helps robots learn better when rules change.
We study the Non-Stationary Reinforcement Learning (RL) under distribution shifts in both finite-horizon episodic and infinite-horizon discounted Markov Decision Processes (MDPs). In the finite-horizon case, the transition functions may suddenly change at a particular episode. In the infinite-horizon setting, such changes can occur at an arbitrary time step during the agent's interaction with the environment. While the Q-learning Upper Confidence Bound algorithm (QUCB) can discover a proper policy during learning, due to the distribution shifts, this policy can exploit sub-optimal rewards after the shift happens. To address this issue, we propose Density-QUCB (DQUCB), a shift-aware Q-learning~UCB algorithm, which uses a transition density function to detect distribution shifts, then leverages its likelihood to enhance the uncertainty estimation quality of Q-learning~UCB, resulting in a balance between exploration and exploitation. Theoretically, we prove that our oracle DQUCB achieves a better regret guarantee than QUCB. Empirically, our DQUCB enjoys the computational efficiency of model-free RL and outperforms QUCB baselines by having a lower regret across RL tasks, as well as a real-world COVID-19 patient hospital allocation task using a Deep-Q-learning architecture.
Similar Papers
Optimistic Reinforcement Learning with Quantile Objectives
Machine Learning (CS)
Teaches computers to make safer, smarter choices.
Empirical Comparison of Forgetting Mechanisms for UCB-based Algorithms on a Data-Driven Simulation Platform
Machine Learning (CS)
Helps computers learn faster when things change.
Smart Exploration in Reinforcement Learning using Bounded Uncertainty Models
Machine Learning (CS)
Teaches computers to learn faster from experience.