Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
By: Syed Naveed Mahmood , Md. Rezaur Rahman Bhuiyan , Tasfia Zaman and more
Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.
Similar Papers
Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models
Computation and Language
Makes AI forget bad or wrong information.
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?
Machine Learning (CS)
Removes unwanted info from AI, making it safer.
The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation
Cryptography and Security
Makes AI forget unwanted information better.