Score: 0

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Published: November 6, 2025 | arXiv ID: 2511.04195v2

By: Nicolò Pagan , Petter Törnberg , Christopher A. Bail and more

Potential Business Impact:

Makes AI talk like people, but it's not quite there.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) are increasingly used in the social sciences to simulate human behavior, based on the assumption that they can generate realistic, human-like text. Yet this assumption remains largely untested. Existing validation efforts rely heavily on human-judgment-based evaluations -- testing whether humans can distinguish AI from human output -- despite evidence that such judgments are blunt and unreliable. As a result, the field lacks robust tools for assessing the realism of LLM-generated text or for calibrating models to real-world data. This paper makes two contributions. First, we introduce a computational Turing test: a validation framework that integrates aggregate metrics (BERT-based detectability and semantic similarity) with interpretable linguistic features (stylistic markers and topical patterns) to assess how closely LLMs approximate human language within a given dataset. Second, we systematically compare nine open-weight LLMs across five calibration strategies -- including fine-tuning, stylistic prompting, and context retrieval -- benchmarking their ability to reproduce user interactions on X (formerly Twitter), Bluesky, and Reddit. Our findings challenge core assumptions in the literature. Even after calibration, LLM outputs remain clearly distinguishable from human text, particularly in affective tone and emotional expression. Instruction-tuned models underperform their base counterparts, and scaling up model size does not enhance human-likeness. Crucially, we identify a trade-off: optimizing for human-likeness often comes at the cost of semantic fidelity, and vice versa. These results provide a much-needed scalable framework for validation and calibration in LLM simulations -- and offer a cautionary note about their current limitations in capturing human communication.

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Computation and Language

AI writing is not as human-like as we thought.

6 Nov 2025 0

91%

How human is the machine? Evidence from 66,000 Conversations with Large Language Models

Human-Computer Interaction

AI sometimes thinks differently than people.

31 Aug 2025 0

90%

A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias

Computation and Language

Helps understand how AI writing is unique and fair.

14 May 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇭 Switzerland

Page Count

22 pages

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

Makes AI talk like people, but it's not quite there.

Technical Abstract

Computational Turing Test Reveals Systematic Differences Between Human and AI Language

How human is the machine? Evidence from 66,000 Conversations with Large Language Models

A Comprehensive Analysis of Large Language Model Outputs: Similarity, Diversity, and Bias