Score: 1

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Published: January 6, 2026 | arXiv ID: 2601.03121v1

By: Peiran Li, Jan Fillies, Adrian Paschke

Potential Business Impact:

Makes AI better at spotting bad words.

Business Areas:

Text Analytics Data and Analytics, Software

Augmenting toxic language data in a controllable and class-specific manner is crucial for improving robustness in toxicity classification, yet remains challenging due to limited supervision and distributional skew. We propose ToxiGAN, a class-aware text augmentation framework that combines adversarial generation with semantic guidance from large language models (LLMs). To address common issues in GAN-based augmentation such as mode collapse and semantic drift, ToxiGAN introduces a two-step directional training strategy and leverages LLM-generated neutral texts as semantic ballast. Unlike prior work that treats LLMs as static generators, our approach dynamically selects neutral exemplars to provide balanced guidance. Toxic samples are explicitly optimized to diverge from these exemplars, reinforcing class-specific contrastive signals. Experiments on four hate speech benchmarks show that ToxiGAN achieves the strongest average performance in both macro-F1 and hate-F1, consistently outperforming traditional and LLM-based augmentation methods. Ablation and sensitivity analyses further confirm the benefits of semantic ballast and directional training in enhancing classifier robustness.

LLM-based Semantic Augmentation for Harmful Content Detection

Computation and Language

Cleans internet text to fight bad posts.

22 Apr 2025 1

88%

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

Computation and Language

Makes online hate speech detectors harder to trick.

16 Sep 2025 0

88%

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

Computation and Language

Teaches computers to spot bad online words.

18 Sep 2025 0

View PDF Login to Bookmark

Country of Origin

🇩🇪 Germany

Repos / Data Links

github.com github.com github.com github.com github.com github.com

Page Count

16 pages

ToxiGAN: Toxic Data Augmentation via LLM-Guided Directional Adversarial Generation

Makes AI better at spotting bad words.

Technical Abstract

LLM-based Semantic Augmentation for Harmful Content Detection

Towards Inclusive Toxic Content Moderation: Addressing Vulnerabilities to Adversarial Attacks in Toxicity Classifiers Tackling LLM-generated Content

SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models