Reward-Preserving Attacks For Robust Reinforcement Learning
By: Lucas Schott , Elies Gherbi , Hatem Hajri and more
Potential Business Impact:
Makes robots learn safely even when tricked.
Adversarial robustness in RL is difficult because perturbations affect entire trajectories: strong attacks can break learning, while weak attacks yield little robustness, and the appropriate strength varies by state. We propose $α$-reward-preserving attacks, which adapt the strength of the adversary so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, we use a gradient-based attack direction and learn a state-dependent magnitude $η\le η_{\mathcal B}$ selected via a critic $Q^π_α((s,a),η)$ trained off-policy over diverse radii. This adaptive tuning calibrates attack strength and, with intermediate $α$, improves robustness across radii while preserving nominal performance, outperforming fixed- and random-radius baselines.
Similar Papers
Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification
Machine Learning (CS)
Tricks AI into making bad choices.
Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach
Machine Learning (CS)
Makes AI systems impossible to trick.
Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks
Machine Learning (CS)
Tricks robots into making bad choices.