Score: 0

Safety Alignment of LMs via Non-cooperative Games

Published: December 23, 2025 | arXiv ID: 2512.20806v1

By: Anselm Paulus , Ilia Kulikov , Brandon Amos and more

Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Machine Learning (CS)

AI learns to defend itself from bad questions.

9 Jun 2025 2

92%

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Machine Learning (CS)

Makes AI safer without retraining it.

10 Oct 2025 0

91%

Agent Safety Alignment via Reinforcement Learning

Artificial Intelligence

Keeps AI safe when it uses outside tools.

11 Jul 2025 0

View PDF Login to Bookmark

Safety Alignment of LMs via Non-cooperative Games

Technical Abstract

Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models

Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers

Agent Safety Alignment via Reinforcement Learning