Score: 1

Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

Published: September 25, 2025 | arXiv ID: 2509.20982v1

By: Valeria Ramirez-Garcia , David de-Fitero-Dominguez , Antonio Garcia-Cabot and more

Potential Business Impact:

AI grades homework like a teacher.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large language models (LLMs) can act as evaluators, a role studied by methods like LLM-as-a-Judge and fine-tuned judging LLMs. In the field of education, LLMs have been studied as assistant tools for students and teachers. Our research investigates LLM-driven automatic evaluation systems for academic Text-Input Problems using rubrics. We propose five evaluation systems that have been tested on a custom dataset of 110 answers about computer science from higher education students with three models: JudgeLM, Llama-3.1-8B and DeepSeek-R1-Distill-Llama-8B. The evaluation systems include: The JudgeLM evaluation, which uses the model's single answer prompt to obtain a score; Reference Aided Evaluation, which uses a correct answer as a guide aside from the original context of the question; No Reference Evaluation, which ommits the reference answer; Additive Evaluation, which uses atomic criteria; and Adaptive Evaluation, which is an evaluation done with generated criteria fitted to each question. All evaluation methods have been compared with the results of a human evaluator. Results show that the best method to automatically evaluate and score Text-Input Problems using LLMs is Reference Aided Evaluation. With the lowest median absolute deviation (0.945) and the lowest root mean square deviation (1.214) when compared to human evaluation, Reference Aided Evaluation offers fair scoring as well as insightful and complete evaluations. Other methods such as Additive and Adaptive Evaluation fail to provide good results in concise answers, No Reference Evaluation lacks information needed to correctly assess questions and JudgeLM Evaluations have not provided good results due to the model's limitations. As a result, we conclude that Artificial Intelligence-driven automatic evaluation systems, aided with proper methodologies, show potential to work as complementary tools to other academic resources.

From Code to Courtroom: LLMs as the New Software Judges

Software Engineering

Lets computers check other computer code quality.

4 Mar 2025 1

93%

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Computation and Language

Helps computers judge writing better than people.

1 Apr 2025 0

93%

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering

Software Engineering

Helps computers judge code quality like people.

10 Feb 2025 2

View PDF Login to Bookmark

Repos / Data Links

github.com github.com github.com

Page Count

41 pages

Analysis of instruction-based LLMs' capabilities to score and judge text-input problems in an academic setting

AI grades homework like a teacher.

Technical Abstract

From Code to Courtroom: LLMs as the New Software Judges

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Can LLMs Replace Human Evaluators? An Empirical Study of LLM-as-a-Judge in Software Engineering