Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language Models
By: Shuzhou Yuan , Ercong Nie , Mario Tawfelis and more
Potential Business Impact:
Makes AI less biased when judging mean words.
Hate speech detection is a socially sensitive and inherently subjective task, with judgments often varying based on personal traits. While prior work has examined how socio-demographic factors influence annotation, the impact of personality traits on Large Language Models (LLMs) remains largely unexplored. In this paper, we present the first comprehensive study on the role of persona prompts in hate speech classification, focusing on MBTI-based traits. A human annotation survey confirms that MBTI dimensions significantly affect labeling behavior. Extending this to LLMs, we prompt four open-source models with MBTI personas and evaluate their outputs across three hate speech datasets. Our analysis uncovers substantial persona-driven variation, including inconsistencies with ground truth, inter-persona disagreement, and logit-level biases. These findings highlight the need to carefully define persona prompts in LLM-based annotation workflows, with implications for fairness and alignment with human values.
Similar Papers
Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection
Computation and Language
Makes AI better at spotting hate speech fairly.
The Impact of Annotator Personas on LLM Behavior Across the Perspectivism Spectrum
Computation and Language
Helps computers judge online hate speech fairly.
Evaluating Prompt-Driven Chinese Large Language Models: The Influence of Persona Assignment on Stereotypes and Safeguards
Computers and Society
Makes AI say mean things about people.