Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis
By: Hongli Li , Che Han Chen , Kevin Fan and more
Potential Business Impact:
Helps computers grade essays as well as people.
Despite the growing promise of large language models (LLMs) in automatic essay scoring (AES), empirical findings regarding their reliability compared to human raters remain mixed. Following the PRISMA 2020 guidelines, we synthesized 65 published and unpublished studies from January 2022 to August 2025 that examined agreement between LLMs and human raters in AES. Across studies, reported LLM-human agreement was generally moderate to good, with agreement indices (e.g., Quadratic Weighted Kappa, Pearson correlation, and Spearman's rho) mostly ranging between 0.30 and 0.80. Substantial variability in agreement levels was observed across studies, reflecting differences in study-specific factors as well as the lack of standardized reporting practices. Implications and directions for future research are discussed.
Similar Papers
Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education
Computers and Society
AI can't reliably grade essays yet.
Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
Computation and Language
Teaches computers to grade essays like humans.
Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment
Computation and Language
Helps computers grade essays with confidence.