Are generative AI text annotations systematically biased?
By: Sjoerd B. Stolwijk, Mark Boukes, Damian Trilling
This paper investigates bias in GLLM annotations by conceptually replicating manual annotations of Boukes (2024). Using various GLLMs (Llama3.1:8b, Llama3.3:70b, GPT4o, Qwen2.5:72b) in combination with five different prompts for five concepts (political content, interactivity, rationality, incivility, and ideology). We find GLLMs perform adequate in terms of F1 scores, but differ from manual annotations in terms of prevalence, yield substantively different downstream results, and display systematic bias in that they overlap more with each other than with manual annotations. Differences in F1 scores fail to account for the degree of bias.
Similar Papers
Benchmarking Educational LLMs with Analytics: A Case Study on Gender Bias in Feedback
Computation and Language
Finds unfairness in AI teacher feedback.
Biases in Large Language Model-Elicited Text: A Case Study in Natural Language Inference
Computation and Language
Finds hidden bias in AI-written text.
A Comprehensive Study of Implicit and Explicit Biases in Large Language Models
Machine Learning (CS)
Finds and fixes unfairness in AI writing.