MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking
By: Iago Alves Brito , Walcy Santos Rezende Rios , Julia Soares Dollis and more
Potential Business Impact:
Finds AI bias against certain groups.
Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating "Identity Hate" into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33\% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.
Similar Papers
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge
Computation and Language
Tests AI for unfairness, making it safer.
JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Computation and Language
Tests AI for hidden dangers in Chinese.
Do Methods to Jailbreak and Defend LLMs Generalize Across Languages?
Computation and Language
Makes AI safer in all languages.