Score: 1

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Published: December 9, 2025 | arXiv ID: 2512.08131v1

By: Sampriti Soor, Suklav Ghosh, Arijit Sur

Potential Business Impact:

Makes AI models easily tricked by short text.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Language models are vulnerable to short adversarial suffixes that can reliably alter predictions. Previous works usually find such suffixes with gradient search or rule-based methods, but these are brittle and often tied to a single task or model. In this paper, a reinforcement learning framework is used where the suffix is treated as a policy and trained with Proximal Policy Optimization against a frozen model as a reward oracle. Rewards are shaped using calibrated cross-entropy, removing label bias and aggregating across surface forms to improve transferability. The proposed method is evaluated on five diverse NLP benchmark datasets, covering sentiment, natural language inference, paraphrase, and commonsense reasoning, using three distinct language models: Qwen2-1.5B Instruct, TinyLlama-1.1B Chat, and Phi-1.5. Results show that RL-trained suffixes consistently degrade accuracy and transfer more effectively across tasks and models than previous adversarial triggers of similar genres.

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Computation and Language

Makes AI models easily fooled by bad words.

9 Dec 2025 0

89%

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Machine Learning (CS)

Stops smart computers from being tricked.

20 Aug 2025 2

87%

RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation

Computation and Language

Makes movie subtitles sound natural in any language.

5 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇮🇳 India

Repos / Data Links

github.com

Page Count

5 pages

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Makes AI models easily tricked by short text.

Technical Abstract

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation