Learning-Time Encoding Shapes Unlearning in LLMs
By: Ruihan Wu, Konstantin Garov, Kamalika Chaudhuri
Potential Business Impact:
Teaches computers to forget bad or wrong information.
As large language models (LLMs) are increasingly deployed in the real world, the ability to ``unlearn'', or remove specific pieces of knowledge post hoc, has become essential for a variety of reasons ranging from privacy regulations to correcting outdated or harmful content. Prior work has proposed unlearning benchmarks and algorithms, and has typically assumed that the training process and the target model are fixed. In this work, we empirically investigate how learning-time choices in knowledge encoding impact the effectiveness of unlearning factual knowledge. Our experiments reveal two key findings: (1) learning with paraphrased descriptions improves unlearning performance and (2) unlearning individual piece of knowledge from a chunk of text is challenging. Our results suggest that learning-time knowledge encoding may play a central role in enabling reliable post-hoc unlearning.
Similar Papers
A Survey on Unlearning in Large Language Models
Computation and Language
Lets AI forget private or bad information.
UCD: Unlearning in LLMs via Contrastive Decoding
Computation and Language
Removes bad info from AI without breaking it.
LLM Unlearning Should Be Form-Independent
Computation and Language
Removes bad ideas from AI, even if phrased differently.