Score: 1

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

Published: January 13, 2026 | arXiv ID: 2601.08951v1

By: Jing-Jing Li , Joel Mire , Eve Fleisig and more

BigTech Affiliations: University of California, Berkeley

Potential Business Impact:

Helps AI understand when people disagree about harm.

Business Areas:

Artificial Intelligence Artificial Intelligence, Data and Analytics, Science and Engineering, Software

Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.

Echoes of AI Harms: A Human-LLM Synergistic Framework for Bias-Driven Harm Anticipation

Computers and Society

Finds AI problems before they hurt people.

27 Nov 2025 0

89%

AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework

Artificial Intelligence

Helps stop AI from causing harm.

12 Sep 2025 1

88%

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Artificial Intelligence

Makes AI understand different people better.

18 Nov 2025 3

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

36 pages

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

Helps AI understand when people disagree about harm.

Technical Abstract

Echoes of AI Harms: A Human-LLM Synergistic Framework for Bias-Driven Harm Anticipation

AI Harmonics: a human-centric and harms severity-adaptive AI risk assessment framework

Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior