Score: 0

ISOPO: Proximal policy gradients without pi-old

Published: December 29, 2025 | arXiv ID: 2512.23353v1

By: Nilin Abrahamsen

Potential Business Impact:

Teaches robots to learn faster with less effort.

Business Areas:
Indoor Positioning Navigation and Mapping

This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.

Page Count
9 pages

Category
Computer Science:
Machine Learning (CS)