Characterising Toxicity in Generative Large Language Models
By: Zhiyao Zhang, Yazan Mash'Al, Yuhan Wu
Potential Business Impact:
Finds ways to stop computers from saying bad things.
In recent years, the advent of the attention mechanism has significantly advanced the field of natural language processing (NLP), revolutionizing text processing and text generation. This has come about through transformer-based decoder-only architectures, which have become ubiquitous in NLP due to their impressive text processing and generation capabilities. Despite these breakthroughs, language models (LMs) remain susceptible to generating undesired outputs: inappropriate, offensive, or otherwise harmful responses. We will collectively refer to these as ``toxic'' outputs. Although methods like reinforcement learning from human feedback (RLHF) have been developed to align model outputs with human values, these safeguards can often be circumvented through carefully crafted prompts. Therefore, this paper examines the extent to which LLMs generate toxic content when prompted, as well as the linguistic factors -- both lexical and syntactic -- that influence the production of such outputs in generative models.
Similar Papers
How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models
Software Engineering
Finds and fixes harmful words in AI.
<think> So let's replace this phrase with insult... </think> Lessons learned from generation of toxic texts with LLMs
Computation and Language
Computers can't learn to remove hate speech well.
Guardians and Offenders: A Survey on Harmful Content Generation and Safety Mitigation
Computation and Language
Makes AI safer and less likely to say bad things.