Wasserstein Policy Optimization
By: David Pfau , Ian Davies , Diana Borsa and more
Potential Business Impact:
Teaches robots to move smoothly and learn faster.
We introduce Wasserstein Policy Optimization (WPO), an actor-critic algorithm for reinforcement learning in continuous action spaces. WPO can be derived as an approximation to Wasserstein gradient flow over the space of all policies projected into a finite-dimensional parameter space (e.g., the weights of a neural network), leading to a simple and completely general closed-form update. The resulting algorithm combines many properties of deterministic and classic policy gradient methods. Like deterministic policy gradients, it exploits knowledge of the gradient of the action-value function with respect to the action. Like classic policy gradients, it can be applied to stochastic policies with arbitrary distributions over actions -- without using the reparameterization trick. We show results on the DeepMind Control Suite and a magnetic confinement fusion task which compare favorably with state-of-the-art continuous control methods.
Similar Papers
PPO in the Fisher-Rao geometry
Machine Learning (CS)
Makes computer learning more reliable and faster.
A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks
Machine Learning (CS)
Teaches computers to make better decisions faster.
Overcoming Non-stationary Dynamics with Evidential Proximal Policy Optimization
Machine Learning (CS)
Helps robots learn faster when things change.