Score: 0

Reward-Preserving Attacks For Robust Reinforcement Learning

Published: January 12, 2026 | arXiv ID: 2601.07118v1

By: Lucas Schott , Elies Gherbi , Hatem Hajri and more

Potential Business Impact:

Makes robots learn safely even when tricked.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Adversarial robustness in RL is difficult because perturbations affect entire trajectories: strong attacks can break learning, while weak attacks yield little robustness, and the appropriate strength varies by state. We propose $α$-reward-preserving attacks, which adapt the strength of the adversary so that an $α$ fraction of the nominal-to-worst-case return gap remains achievable at each state. In deep RL, we use a gradient-based attack direction and learn a state-dependent magnitude $η\le η_{\mathcal B}$ selected via a critic $Q^π_α((s,a),η)$ trained off-policy over diverse radii. This adaptive tuning calibrates attack strength and, with intermediate $α$, improves robustness across radii while preserving nominal performance, outperforming fixed- and random-radius baselines.

Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification

Machine Learning (CS)

Tricks AI into making bad choices.

24 Jul 2025 0

89%

Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach

Machine Learning (CS)

Makes AI systems impossible to trick.

15 Oct 2025 1

89%

Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks

Machine Learning (CS)

Tricks robots into making bad choices.

26 Mar 2025 1

View PDF Login to Bookmark

Page Count

19 pages

Reward-Preserving Attacks For Robust Reinforcement Learning

Makes robots learn safely even when tricked.

Technical Abstract

Policy Disruption in Reinforcement Learning:Adversarial Attack with Large Language Models and Critical State Identification

Provably Invincible Adversarial Attacks on Reinforcement Learning Systems: A Rate-Distortion Information-Theoretic Approach

Robust Deep Reinforcement Learning in Robotics via Adaptive Gradient-Masked Adversarial Attacks