Revisiting Actor-Critic Methods in Discrete Action Off-Policy Reinforcement Learning
By: Reza Asad, Reza Babanezhad, Sharan Vaswani
Potential Business Impact:
Makes game AI learn better from past games.
Value-based approaches such as DQN are the default methods for off-policy reinforcement learning with discrete-action environments such as Atari. Common policy-based methods are either on-policy and do not effectively learn from off-policy data (e.g. PPO), or have poor empirical performance in the discrete-action setting (e.g. SAC). Consequently, starting from discrete SAC (DSAC), we revisit the design of actor-critic methods in this setting. First, we determine that the coupling between the actor and critic entropy is the primary reason behind the poor performance of DSAC. We demonstrate that by merely decoupling these components, DSAC can have comparable performance as DQN. Motivated by this insight, we introduce a flexible off-policy actor-critic framework that subsumes DSAC as a special case. Our framework allows using an m-step Bellman operator for the critic update, and enables combining standard policy optimization methods with entropy regularization to instantiate the resulting actor objective. Theoretically, we prove that the proposed methods can guarantee convergence to the optimal regularized value function in the tabular setting. Empirically, we demonstrate that these methods can approach the performance of DQN on standard Atari games, and do so even without entropy regularization or explicit exploration.
Similar Papers
Effective Reinforcement Learning Control using Conservative Soft Actor-Critic
Robotics
Teaches robots to learn and move better.
Actor-Critics Can Achieve Optimal Sample Efficiency
Machine Learning (Stat)
Teaches computers to learn faster with less data.
DR-SAC: Distributionally Robust Soft Actor-Critic for Reinforcement Learning under Uncertainty
Machine Learning (CS)
Makes robots learn better even when things change.