PPO in the Fisher-Rao geometry
By: Razvan-Andrei Lascu, David Šiška, Łukasz Szpruch
Potential Business Impact:
Makes computer learning more reliable and faster.
Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning, offering a practical policy gradient method with strong empirical performance. Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence. PPO is motivated by Trust Region Policy Optimization (TRPO) that utilizes a surrogate loss with a KL divergence penalty, which arises from linearizing the value function within a flat geometric space. In this paper, we derive a tighter surrogate in the Fisher-Rao (FR) geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO). Our proposed scheme provides strong theoretical guarantees, including monotonic policy improvement. Furthermore, in the tabular setting, we demonstrate that FR-PPO achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.
Similar Papers
Truncated Proximal Policy Optimization
Artificial Intelligence
Trains smart computer brains to solve problems faster.
Value-Free Policy Optimization via Reward Partitioning
Machine Learning (CS)
Teaches computers to learn from simple feedback.
Central Path Proximal Policy Optimization
Machine Learning (CS)
Teaches robots to follow rules without losing skill.