Central Path Proximal Policy Optimization
By: Nikola Milosevic, Johannes Müller, Nico Scherf
Potential Business Impact:
Teaches robots to follow rules without losing skill.
In constrained Markov decision processes, enforcing constraints during training is often thought of as decreasing the final return. Recently, it was shown that constraints can be incorporated directly into the policy geometry, yielding an optimization trajectory close to the central path of a barrier method, which does not compromise final return. Building on this idea, we introduce Central Path Proximal Policy Optimization (C3PO), a simple modification of the PPO loss that produces policy iterates, that stay close to the central path of the constrained optimization problem. Compared to existing on-policy methods, C3PO delivers improved performance with tighter constraint enforcement, suggesting that central path-guided updates offer a promising direction for constrained policy optimization.
Similar Papers
PPO in the Fisher-Rao geometry
Machine Learning (CS)
Makes computer learning more reliable and faster.
Truncated Proximal Policy Optimization
Artificial Intelligence
Trains smart computer brains to solve problems faster.
Towards Causal Model-Based Policy Optimization
Machine Learning (CS)
Teaches computers to make better choices when things change.