Score: 0

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Published: December 22, 2025 | arXiv ID: 2512.19025v2

By: Hengrui Jia , Taoran Li , Jonas Guan and more

Machine unlearning aims to remove specific data influences from trained models, a capability essential for adhering to copyright laws and ensuring AI safety. Current unlearning metrics typically measure success by monitoring the model's performance degradation on the specific unlearning dataset ($D_u$). We argue that for Large Language Models (LLMs), this evaluation paradigm is insufficient and potentially misleading. Many real-world uses of unlearning--motivated by copyright or safety--implicitly target not only verbatim content in $D_u$, but also behaviors influenced by the broader generalizations the model derived from it. We demonstrate that LLMs can pass standard unlearning evaluation and appear to have "forgotten" the target knowledge, while simultaneously retaining strong capabilities on content that is semantically adjacent to $D_u$. This phenomenon indicates that erasing exact sentences does not necessarily equate to removing the underlying knowledge. To address this gap, we propose Proximal Surrogate Generation (PSG), an automated stress-testing framework that generates a surrogate dataset, $\tilde{D}_u$. This surrogate set is constructed to be semantically derived from $D_u$ yet sufficiently distinct in embedding space. By comparing unlearning metric scores between $D_u$ and $\tilde{D}_u$, we can stress-test the reliability of the metric itself. Our extensive evaluation across three LLM families (Llama-3-8B, Qwen2.5-7B, and Zephyr-7B-$β$), three distinct datasets, and seven standard metrics reveals widespread inconsistencies. We find that current metrics frequently overestimate unlearning success, failing to detect retained knowledge exposed by our stress-test datasets.

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Cryptography and Security

Makes AI forget things better, not just words.

22 Dec 2025 0

92%

Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs

Machine Learning (CS)

Removes bad info from AI, making it safer.

2 Sep 2025 1

92%

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Machine Learning (CS)

Makes AI forget private information reliably.

7 Nov 2025 1

View PDF Login to Bookmark

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Technical Abstract

The Erasure Illusion: Stress-Testing the Generalization of LLM Forgetting Evaluation

Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding