Theory Trace Card: Theory-Driven Socio-Cognitive Evaluation of LLMs
By: Farzan Karimi-Malekabadi , Suhaib Abdurahman , Zhivar Sourati and more
Potential Business Impact:
Helps AI show what it truly knows.
Socio-cognitive benchmarks for large language models (LLMs) often fail to predict real-world behavior, even when models achieve high benchmark scores. Prior work has attributed this evaluation-deployment gap to problems of measurement and validity. While these critiques are insightful, we argue that they overlook a more fundamental issue: many socio-cognitive evaluations proceed without an explicit theoretical specification of the target capability, leaving the assumptions linking task performance to competence implicit. Without this theoretical grounding, benchmarks that exercise only narrow subsets of a capability are routinely misinterpreted as evidence of broad competence: a gap that creates a systemic validity illusion by masking the failure to evaluate the capability's other essential dimensions. To address this gap, we make two contributions. First, we diagnose and formalize this theory gap as a foundational failure that undermines measurement and enables systematic overgeneralization of benchmark results. Second, we introduce the Theory Trace Card (TTC), a lightweight documentation artifact designed to accompany socio-cognitive evaluations, which explicitly outlines the theoretical basis of an evaluation, the components of the target capability it exercises, its operationalization, and its limitations. We argue that TTCs enhance the interpretability and reuse of socio-cognitive evaluations by making explicit the full validity chain, which links theory, task operationalization, scoring, and limitations, without modifying benchmarks or requiring agreement on a single theory.
Similar Papers
Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective
Human-Computer Interaction
Helps computers understand what people are thinking.
Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Artificial Intelligence
Teaches computers to think more like people.
ReEfBench: Quantifying the Reasoning Efficiency of LLMs
Artificial Intelligence
Finds if AI truly reasons or just talks a lot.