Score: 1

UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

Published: April 29, 2025 | arXiv ID: 2504.20500v1

By: Huimin Lu , Masaru Isonuma , Junichiro Mori and more

Potential Business Impact:

Cleans up mean computer talk for any AI.

Business Areas:

Text Analytics Data and Analytics, Software

We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.

Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

Computation and Language

Makes AI talk nicely without getting dumb.

2 Jun 2025 3

88%

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

Computation and Language

Cleans up bad words in many languages.

24 Jul 2025 2

87%

LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification

Computation and Language

Cleans up mean online words automatically.

2 Jun 2025 1

View PDF Login to Bookmark

Country of Origin

🇯🇵 Japan

Repos / Data Links

github.com

Page Count

24 pages

UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

Cleans up mean computer talk for any AI.

Technical Abstract

Detoxification of Large Language Models through Output-layer Fusion with a Calibration Model

ylmmcl at Multilingual Text Detoxification 2025: Lexicon-Guided Detoxification and Classifier-Gated Rewriting

LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech Detoxification