Score: 3

Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Published: October 5, 2025 | arXiv ID: 2510.04347v1

By: Anindya Sundar Das, Kangjie Chen, Monowar Bhuyan

Potential Business Impact:

Stops bad computer tricks in text messages.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Pre-trained language models have achieved remarkable success across a wide range of natural language processing (NLP) tasks, particularly when fine-tuned on large, domain-relevant datasets. However, they remain vulnerable to backdoor attacks, where adversaries embed malicious behaviors using trigger patterns in the training data. These triggers remain dormant during normal usage, but, when activated, can cause targeted misclassifications. In this work, we investigate the internal behavior of backdoored pre-trained encoder-based language models, focusing on the consistent shift in attention and gradient attribution when processing poisoned inputs; where the trigger token dominates both attention and gradient signals, overriding the surrounding context. We propose an inference-time defense that constructs anomaly scores by combining token-level attention and gradient information. Extensive experiments on text classification tasks across diverse backdoor attack scenarios demonstrate that our method significantly reduces attack success rates compared to existing baselines. Furthermore, we provide an interpretability-driven analysis of the scoring mechanism, shedding light on trigger localization and the robustness of the proposed defense.

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Cryptography and Security

Finds hidden "bad instructions" in AI.

16 Nov 2025 0

89%

Illuminating the Black Box: Real-Time Monitoring of Backdoor Unlearning in CNNs via Explainable AI

Cryptography and Security

Cleans computer brains of hidden bad instructions.

26 Nov 2025 0

88%

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

Computation and Language

Finds hidden "bad instructions" in AI.

19 Aug 2025 1

View PDF Login to Bookmark

Country of Origin

🇸🇬 🇸🇪 Sweden, Singapore

Repos / Data Links

github.com

Page Count

15 pages

Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models

Stops bad computer tricks in text messages.

Technical Abstract

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks

Illuminating the Black Box: Real-Time Monitoring of Backdoor Unlearning in CNNs via Explainable AI

Mechanistic Exploration of Backdoored Large Language Model Attention Patterns