UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets
By: Wenyu Wang , Mengqi Zhang , Xiaotian Ye and more
Potential Business Impact:
Cleans harmful knowledge from AI without breaking it.
Large Language Models (LLMs) inevitably acquire harmful information during training on massive datasets. LLM unlearning aims to eliminate the influence of such harmful information while maintaining the model's overall performance. Existing unlearning methods, represented by gradient ascent-based approaches, primarily focus on forgetting target data while overlooking the crucial impact of logically related knowledge on the effectiveness of unlearning. In this paper, through both theoretical and experimental analyses, we first demonstrate that a key reason for the suboptimal unlearning performance is that models can reconstruct the target content through reasoning with logically related knowledge. To address this issue, we propose Unlearning Improvement via Parameter Extrapolation (UIPE), a method that removes knowledge highly correlated with the forgetting targets. Experimental results show that UIPE significantly enhances the performance of various mainstream LLM unlearning methods on the TOFU benchmark.
Similar Papers
Investigating Model Editing for Unlearning in Large Language Models
Computation and Language
Removes bad info from AI without breaking it.
Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs
Machine Learning (CS)
Removes bad info from AI, making it safer.
iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
Computation and Language
Teaches computers to forget bad information.