Score: 0

PPO in the Fisher-Rao geometry

Published: June 4, 2025 | arXiv ID: 2506.03757v1

By: Razvan-Andrei Lascu, David Šiška, Łukasz Szpruch

Potential Business Impact:

Makes computer learning more reliable and faster.

Business Areas:
Peer to Peer Collaboration

Proximal Policy Optimization (PPO) has become a widely adopted algorithm for reinforcement learning, offering a practical policy gradient method with strong empirical performance. Despite its popularity, PPO lacks formal theoretical guarantees for policy improvement and convergence. PPO is motivated by Trust Region Policy Optimization (TRPO) that utilizes a surrogate loss with a KL divergence penalty, which arises from linearizing the value function within a flat geometric space. In this paper, we derive a tighter surrogate in the Fisher-Rao (FR) geometry, yielding a novel variant, Fisher-Rao PPO (FR-PPO). Our proposed scheme provides strong theoretical guarantees, including monotonic policy improvement. Furthermore, in the tabular setting, we demonstrate that FR-PPO achieves sub-linear convergence without any dependence on the dimensionality of the action or state spaces, marking a significant step toward establishing formal convergence results for PPO-based algorithms.

Page Count
17 pages

Category
Computer Science:
Machine Learning (CS)