Score: 1

Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings

Published: June 17, 2025 | arXiv ID: 2506.14997v1

By: Harbin Hong, Sebastian Caldas, Liu Leqi

BigTech Affiliations: Princeton University

Potential Business Impact:

Tests if computer brains copy people's choices.

Business Areas:

A/B Testing Data and Analytics

As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the alignment of this language model with the tested populations, highlighting the need for new practices in using LLMs for social science studies beyond naive simulations of human subjects.

Evaluating and Aligning Human Economic Risk Preferences in LLMs

General Economics

Makes AI make smarter money choices.

9 Mar 2025 0

90%

Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation

Computers and Society

Tests if AI opinions are trustworthy for surveys.

11 Apr 2025 0

90%

Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Computation and Language

Makes computers act more like people in studies.

26 Nov 2025 0

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

11 pages

Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings

Tests if computer brains copy people's choices.

Technical Abstract

Evaluating and Aligning Human Economic Risk Preferences in LLMs

Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation

Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?