Safety Alignment of LMs via Non-cooperative Games
By: Anselm Paulus , Ilia Kulikov , Brandon Amos and more
Ensuring the safety of language models (LMs) while maintaining their usefulness remains a critical challenge in AI alignment. Current approaches rely on sequential adversarial training: generating adversarial prompts and fine-tuning LMs to defend against them. We introduce a different paradigm: framing safety alignment as a non-zero-sum game between an Attacker LM and a Defender LM trained jointly via online reinforcement learning. Each LM continuously adapts to the other's evolving strategies, driving iterative improvement. Our method uses a preference-based reward signal derived from pairwise comparisons instead of point-wise scores, providing more robust supervision and potentially reducing reward hacking. Our RL recipe, AdvGame, shifts the Pareto frontier of safety and utility, yielding a Defender LM that is simultaneously more helpful and more resilient to adversarial attacks. In addition, the resulting Attacker LM converges into a strong, general-purpose red-teaming agent that can be directly deployed to probe arbitrary target models.
Similar Papers
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
Machine Learning (CS)
AI learns to defend itself from bad questions.
Safety Game: Balancing Safe and Informative Conversations with Blackbox Agentic AI using LP Solvers
Machine Learning (CS)
Makes AI safer without retraining it.
Agent Safety Alignment via Reinforcement Learning
Artificial Intelligence
Keeps AI safe when it uses outside tools.