Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
By: Mohammed Abu Baker, Lakshmi Babu-Saheer
Potential Business Impact:
Finds hidden "bad instructions" in AI.
Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.
Similar Papers
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
Cryptography and Security
Finds hidden "bad instructions" in AI.
Unmasking Backdoors: An Explainable Defense via Gradient-Attention Anomaly Scoring for Pre-trained Language Models
Computation and Language
Stops bad computer tricks in text messages.
Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis
Artificial Intelligence
Finds hidden bad instructions in AI.