Rethinking Toxicity Evaluation in Large Language Models: A Multi-Label Perspective
By: Zhiqiang Kou , Junyang Chen , Xin-Qiang Cai and more
Potential Business Impact:
Finds harmful text in computer writing better.
Large language models (LLMs) have achieved impressive results across a range of natural language processing tasks, but their potential to generate harmful content has raised serious safety concerns. Current toxicity detectors primarily rely on single-label benchmarks, which cannot adequately capture the inherently ambiguous and multi-dimensional nature of real-world toxic prompts. This limitation results in biased evaluations, including missed toxic detections and false positives, undermining the reliability of existing detectors. Additionally, gathering comprehensive multi-label annotations across fine-grained toxicity categories is prohibitively costly, further hindering effective evaluation and development. To tackle these issues, we introduce three novel multi-label benchmarks for toxicity detection: \textbf{Q-A-MLL}, \textbf{R-A-MLL}, and \textbf{H-X-MLL}, derived from public toxicity datasets and annotated according to a detailed 15-category taxonomy. We further provide a theoretical proof that, on our released datasets, training with pseudo-labels yields better performance than directly learning from single-label supervision. In addition, we develop a pseudo-label-based toxicity detection method. Extensive experimental results show that our approach significantly surpasses advanced baselines, including GPT-4o and DeepSeek, thus enabling more accurate and reliable evaluation of multi-label toxicity in LLM-generated content.
Similar Papers
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Computation and Language
Makes AI safer and less likely to say bad things.
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation of LLM
Computation and Language
Makes AI safer and less likely to say bad things.
Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages
Computation and Language
Makes AI safer for different languages.