Synthetic social data: trials and tribulations
By: Guido Ivetta, Laura Moradbakhti, Rafael A. Calvo
Potential Business Impact:
AI text is less reliable than real people for studies.
Large Language Models are being used in conversational agents that simulate human conversations and generate social studies data. While concerns about the models' biases have been raised and discussed in the literature, much about the data generated is still unknown. In this study we explore the statistical representation of social values across four countries (UK, Argentina, USA and China) for six LLMs, with equal representation for open and closed weights. By comparing machine-generated outputs with actual human survey data, we assess whether algorithmic biases in LLMs outweigh the biases inherent in real- world sampling, including demographic and response biases. Our findings suggest that, despite the logistical and financial constraints of human surveys, even a small, skewed sample of real respondents may provide more reliable insights than synthetic data produced by LLMs. These results highlight the limitations of using AI-generated text for social research and emphasize the continued importance of empirical human data collection.
Similar Papers
Social Simulations with Large Language Model Risk Utopian Illusion
Computation and Language
Computers show fake, too-nice people in chats.
Simulating Online Social Media Conversations on Controversial Topics Using AI Agents Calibrated on Real-World Data
Social and Information Networks
Computers can now pretend to be people online.
Emulating Public Opinion: A Proof-of-Concept of AI-Generated Synthetic Survey Responses for the Chilean Case
Computation and Language
Computers can answer survey questions like people.