Score: 2

Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

Published: June 15, 2025 | arXiv ID: 2506.12963v1

By: Changsheng Wang , Chongyu Fan , Yihua Zhang and more

Potential Business Impact:

Cleans harmful thoughts from smart computer brains.

Business Areas:

Machine Learning Artificial Intelligence, Data and Analytics, Software

Recent advances in large reasoning models (LRMs) have enabled strong chain-of-thought (CoT) generation through test-time computation. While these multi-step reasoning capabilities represent a major milestone in language model performance, they also introduce new safety risks. In this work, we present the first systematic study to revisit the problem of machine unlearning in the context of LRMs. Machine unlearning refers to the process of removing the influence of sensitive, harmful, or undesired data or knowledge from a trained model without full retraining. We show that conventional unlearning algorithms, originally designed for non-reasoning models, are inadequate for LRMs. In particular, even when final answers are successfully erased, sensitive information often persists within the intermediate reasoning steps, i.e., CoT trajectories. To address this challenge, we extend conventional unlearning and propose Reasoning-aware Representation Misdirection for Unlearning ($R^2MU$), a novel method that effectively suppresses sensitive reasoning traces and prevents the generation of associated final answers, while preserving the model's reasoning ability. Our experiments demonstrate that $R^2MU$ significantly reduces sensitive information leakage within reasoning traces and achieves strong performance across both safety and reasoning benchmarks, evaluated on state-of-the-art models such as DeepSeek-R1-Distill-LLaMA-8B and DeepSeek-R1-Distill-Qwen-14B.

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Machine Learning (CS)

Finds hidden clues when computers forget things.

16 Jun 2025 1

90%

Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy

Cryptography and Security

Makes AI forget bad things without breaking good things.

31 May 2025 1

90%

Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs

Machine Learning (CS)

Removes bad info from AI, making it safer.

2 Sep 2025 1

View PDF Login to Bookmark

Repos / Data Links

github.com

Page Count

16 pages

Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills

Cleans harmful thoughts from smart computer brains.

Technical Abstract

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

Keeping an Eye on LLM Unlearning: The Hidden Risk and Remedy

Unlearning That Lasts: Utility-Preserving, Robust, and Almost Irreversible Forgetting in LLMs