Score: 0

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Published: August 4, 2025 | arXiv ID: 2508.02442v1

By: Andrea Gaggioli , Giuseppe Casaburi , Leonardo Ercolani and more

Potential Business Impact:

AI can't reliably grade essays yet.

This study investigates the reliability and validity of five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, for automated essay scoring in a real world higher education context. A total of 67 Italian-language student essays, written as part of a university psychology course, were evaluated using a four-criterion rubric (Pertinence, Coherence, Originality, Feasibility). Each model scored all essays across three prompt replications to assess intra-model stability. Human-LLM agreement was consistently low and non-significant (Quadratic Weighted Kappa), and within-model reliability across replications was similarly weak (median Kendall's W < 0.30). Systematic scoring divergences emerged, including a tendency to inflate Coherence and inconsistent handling of context-dependent dimensions. Inter-model agreement analysis revealed moderate convergence for Coherence and Originality, but negligible concordance for Pertinence and Feasibility. Although limited in scope, these findings suggest that current LLMs may struggle to replicate human judgment in tasks requiring disciplinary insight and contextual sensitivity. Human oversight remains critical when evaluating open-ended academic work, particularly in interpretive domains.

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Computers and Society

Computers grade student code, but not like teachers.

30 Sep 2025 1

92%

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Computation and Language

Helps computers grade essays as well as people.

16 Dec 2025 0

92%

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Computation and Language

Computer grades student work like a teacher.

13 Nov 2025 1

View PDF Login to Bookmark

Page Count

24 pages

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

AI can't reliably grade essays yet.

Technical Abstract

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation