Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension?
By: KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
Potential Business Impact:
Makes AI tutors act like real students.
Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models' performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
Similar Papers
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
Computation and Language
Computers can't yet help make tests better.
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study
Computation and Language
Computers can't teach as well as humans yet.
Delving Into the Psychology of Machines: Exploring the Structure of Self-Regulated Learning via LLM-Generated Survey Responses
Artificial Intelligence
Computers can pretend to be students learning.