Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
By: Leander Girrbach , Stephan Alaniz , Genevieve Smith and more
Potential Business Impact:
Finds unfairness in AI from its training pictures.
Vision-language models trained on large-scale multimodal datasets show strong demographic biases, but the role of training data in producing these biases remains unclear. A major barrier has been the lack of demographic annotations in web-scale datasets such as LAION-400M. We address this gap by creating person-centric annotations for the full dataset, including over 276 million bounding boxes, perceived gender and race/ethnicity labels, and automatically generated captions. These annotations are produced through validated automatic labeling pipelines combining object detection, multimodal captioning, and finetuned classifiers. Using them, we uncover demographic imbalances and harmful associations, such as the disproportionate linking of men and individuals perceived as Black or Middle Eastern with crime-related and negative content. We also show that 60-70% of gender bias in CLIP and Stable Diffusion can be linearly explained by direct co-occurrences in the data. Our resources establish the first large-scale empirical link between dataset composition and downstream model bias.
Similar Papers
Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation
Computation and Language
Helps computers judge sexism fairly, not by who wrote it.
Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
Machine Learning (CS)
Makes AI see fairer by changing its learning pictures.
Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models
Machine Learning (CS)
Fixes computer vision bias from bad data.