Score: 1

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Published: September 30, 2025 | arXiv ID: 2509.26483v1

By: Marcin Jukiewicz

Potential Business Impact:

Computers grade student code, but not like teachers.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This study presents the first large-scale, side-by-side comparison of contemporary Large Language Models (LLMs) in the automated grading of programming assignments. Drawing on over 6,000 student submissions collected across four years of an introductory programming course, we systematically analysed the distribution of grades, differences in mean scores and variability reflecting stricter or more lenient grading, and the consistency and clustering of grading patterns across models. Eighteen publicly available models were evaluated: Anthropic (claude-3-5-haiku, claude-opus-4-1, claude-sonnet-4); Deepseek (deepseek-chat, deepseek-reasoner); Google (gemini-2.0-flash-lite, gemini-2.0-flash, gemini-2.5-flash-lite, gemini-2.5-flash, gemini-2.5-pro); and OpenAI (gpt-4.1-mini, gpt-4.1-nano, gpt-4.1, gpt-4o-mini, gpt-4o, gpt-5-mini, gpt-5-nano, gpt-5). Statistical tests, correlation and clustering analyses revealed clear, systematic differences between and within vendor families, with "mini" and "nano" variants consistently underperforming their full-scale counterparts. All models displayed high internal agreement, measured by the intraclass correlation coefficient, with the model consensus but only moderate agreement with human teachers' grades, indicating a persistent gap between automated and human assessment. These findings underscore that the choice of model for educational deployment is not neutral and should be guided by pedagogical goals, transparent reporting of evaluation metrics, and ongoing human oversight to ensure accuracy, fairness and relevance.

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Computers and Society

AI can't reliably grade essays yet.

4 Aug 2025 0

92%

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Artificial Intelligence

Helps AI tutors give better, personalized learning help.

2 Sep 2025 1

91%

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Computation and Language

Tests AI to find best answers for money news.

24 Jul 2025 0

View PDF Login to Bookmark

Country of Origin

🇵🇱 Poland

Page Count

31 pages

A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor

Computers grade student code, but not like teachers.

Technical Abstract

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis