Score: 1

Assessing Representation Stability for Transformer Models

Published: August 6, 2025 | arXiv ID: 2508.11667v1

By: Bryan E. Tuck, Rakesh M. Verma

Potential Business Impact:

Stops bad words from tricking computers.

Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

Machine Learning (CS)

Makes wireless sensing safer from hacking.

25 Nov 2025 0

87%

RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking

Cryptography and Security

Protects search results from fake information.

29 Dec 2025 0

86%

Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles

CV and Pattern Recognition

Protects AI from being tricked by fake images.

3 Oct 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Repos / Data Links

github.com

Page Count

19 pages

Assessing Representation Stability for Transformer Models

Stops bad words from tricking computers.

Technical Abstract

Towards Trustworthy Wi-Fi Sensing: Systematic Evaluation of Deep Learning Model Robustness to Adversarial Attacks

RobustMask: Certified Robustness against Adversarial Neural Ranking Attack via Randomized Masking

Test-Time Defense Against Adversarial Attacks via Stochastic Resonance of Latent Ensembles