Score: 0

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Published: December 9, 2025 | arXiv ID: 2512.08123v1

By: Sampriti Soor, Suklav Ghosh, Arijit Sur

Potential Business Impact:

Makes AI models easily fooled by bad words.

Business Areas:

Semantic Search Internet Services

Language models (LMs) are often used as zero-shot or few-shot classifiers by scoring label words, but they remain fragile to adversarial prompts. Prior work typically optimizes task- or model-specific triggers, making results difficult to compare and limiting transferability. We study universal adversarial suffixes: short token sequences (4-10 tokens) that, when appended to any input, broadly reduce accuracy across tasks and models. Our approach learns the suffix in a differentiable "soft" form using Gumbel-Softmax relaxation and then discretizes it for inference. Training maximizes calibrated cross-entropy on the label region while masking gold tokens to prevent trivial leakage, with entropy regularization to avoid collapse. A single suffix trained on one model transfers effectively to others, consistently lowering both accuracy and calibrated confidence. Experiments on sentiment analysis, natural language inference, paraphrase detection, commonsense QA, and physical reasoning with Qwen2-1.5B, Phi-1.5, and TinyLlama-1.1B demonstrate consistent attack effectiveness and transfer across tasks and model families.

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Computation and Language

Makes AI models easily tricked by short text.

9 Dec 2025 1

89%

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Machine Learning (CS)

Stops smart computers from being tricked.

20 Aug 2025 2

88%

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models

Computation and Language

Makes AI say bad things even when it shouldn't.

24 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇮🇳 India

Page Count

10 pages

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Makes AI models easily fooled by bad words.

Technical Abstract

Universal Adversarial Suffixes for Language Models Using Reinforcement Learning with Calibrated Reward

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

Toward Understanding the Transferability of Adversarial Suffixes in Large Language Models