ISOPO: Proximal policy gradients without pi-old
By: Nilin Abrahamsen
Potential Business Impact:
Teaches robots to learn faster with less effort.
This note introduces Isometric Policy Optimization (ISOPO), an efficient method to approximate the natural policy gradient in a single gradient step. In comparison, existing proximal policy methods such as GRPO or CISPO use multiple gradient steps with variants of importance ratio clipping to approximate a natural gradient step relative to a reference policy. In its simplest form, ISOPO normalizes the log-probability gradient of each sequence in the Fisher metric before contracting with the advantages. Another variant of ISOPO transforms the microbatch advantages based on the neural tangent kernel in each layer. ISOPO applies this transformation layer-wise in a single backward pass and can be implemented with negligible computational overhead compared to vanilla REINFORCE.
Similar Papers
Think Outside the Policy: In-Context Steered Policy Optimization
Machine Learning (CS)
Teaches computers to solve math problems better.
Reparameterization Proximal Policy Optimization
Machine Learning (CS)
Teaches robots to learn faster and more reliably.
Soft Adaptive Policy Optimization
Machine Learning (CS)
Teaches AI to learn better and faster.