Exchange Policy Optimization Algorithm for Semi-Infinite Safe Reinforcement Learning
By: Jiaming Zhang , Yujie Yang , Haoning Wang and more
Potential Business Impact:
Keeps robots safe while they learn new tasks.
Safe reinforcement learning (safe RL) aims to respect safety requirements while optimizing long-term performance. In many practical applications, however, the problem involves an infinite number of constraints, known as semi-infinite safe RL (SI-safe RL). Such constraints typically appear when safety conditions must be enforced across an entire continuous parameter space, such as ensuring adequate resource distribution at every spatial location. In this paper, we propose exchange policy optimization (EPO), an algorithmic framework that achieves optimal policy performance and deterministic bounded safety. EPO works by iteratively solving safe RL subproblems with finite constraint sets and adaptively adjusting the active set through constraint expansion and deletion. At each iteration, constraints with violations exceeding the predefined tolerance are added to refine the policy, while those with zero Lagrange multipliers are removed after the policy update. This exchange rule prevents uncontrolled growth of the working set and supports effective policy training. Our theoretical analysis demonstrates that, under mild assumptions, strategies trained via EPO achieve performance comparable to optimal solutions with global constraint violations strictly remaining within a prescribed bound.
Similar Papers
Incentivizing Safer Actions in Policy Optimization for Constrained Reinforcement Learning
Machine Learning (CS)
Keeps robots safe while they learn tasks.
EEPO: Exploration-Enhanced Policy Optimization via Sample-Then-Forget
Computation and Language
Helps AI learn new things by forgetting and trying again.
Evolutionary Policy Optimization
Machine Learning (CS)
Teaches robots to learn faster and better.