Score: 2

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

Published: August 21, 2025 | arXiv ID: 2508.15449v1

By: Chengcan Wu , Zeming Wei , Huanran Chen and more

Potential Business Impact:

Removes bad information from AI, keeping good.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in https://github.com/ChengcanWu/MRP.

Unlearning Imperative: Securing Trustworthy and Responsible LLMs through Engineered Forgetting

Machine Learning (CS)

Lets AI forget private information when asked.

13 Nov 2025 0

89%

A Survey on Unlearning in Large Language Models

Computation and Language

Lets AI forget private or bad information.

29 Oct 2025 1

89%

MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines

Machine Learning (CS)

Removes unwanted information from AI models easily.

30 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇨🇳 China

Repos / Data Links

github.com

Page Count

20 pages

Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection

Removes bad information from AI, keeping good.

Technical Abstract

Unlearning Imperative: Securing Trustworthy and Responsible LLMs through Engineered Forgetting

A Survey on Unlearning in Large Language Models

MPRU: Modular Projection-Redistribution Unlearning as Output Filter for Classification Pipelines