Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning
By: Yang Xu, Swetha Ganesh, Vaneet Aggarwal
Potential Business Impact:
Teaches computers to make good choices even with bad info.
We present a non-asymptotic convergence analysis of $Q$-learning and actor-critic algorithms for robust average-reward Markov Decision Processes (MDPs) under contamination, total-variation (TV) distance, and Wasserstein uncertainty sets. A key ingredient of our analysis is showing that the optimal robust $Q$ operator is a strict contraction with respect to a carefully designed semi-norm (with constant functions quotiented out). This property enables a stochastic approximation update that learns the optimal robust $Q$-function using $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We also provide an efficient routine for robust $Q$-function estimation, which in turn facilitates robust critic estimation. Building on this, we introduce an actor-critic algorithm that learns an $\epsilon$-optimal robust policy within $\tilde{\mathcal{O}}(\epsilon^{-2})$ samples. We provide numerical simulations to evaluate the performance of our algorithms.
Similar Papers
Actor-Critics Can Achieve Optimal Sample Efficiency
Machine Learning (Stat)
Teaches computers to learn faster with less data.
Sample Complexity of Distributionally Robust Average-Reward Reinforcement Learning
Machine Learning (CS)
Helps robots learn tasks better and faster.
Provably Sample-Efficient Robust Reinforcement Learning with Average Reward
Machine Learning (CS)
Helps computers learn better with less data.