Score: 2

Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Published: September 19, 2025 | arXiv ID: 2509.15926v1

By: Ahmed Karim, Qiao Wang, Zheng Yuan

Potential Business Impact:

Helps computers grade essays with confidence.

Business Areas:

Quality Assurance Professional Services

Automated Essay Scoring (AES) systems now reach near human agreement on some public benchmarks, yet real-world adoption, especially in high-stakes examinations, remains limited. A principal obstacle is that most models output a single score without any accompanying measure of confidence or explanation. We address this gap with conformal prediction, a distribution-free wrapper that equips any classifier with set-valued outputs and formal coverage guarantees. Two open-source large language models (Llama-3 8B and Qwen-2.5 3B) are fine-tuned on three diverse corpora (ASAP, TOEFL11, Cambridge-FCE) and calibrated at a 90 percent risk level. Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise. To our knowledge, this is the first work to combine conformal prediction and UAcc for essay scoring. The calibrated models consistently meet the coverage target while keeping prediction sets compact, indicating that open-source, mid-sized LLMs can already support teacher-in-the-loop AES; we discuss scaling and broader user studies as future work.

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Computation and Language

Helps computers grade essays as well as people.

16 Dec 2025 0

90%

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Computation and Language

Helps computers grade essays more like humans.

13 May 2025 0

90%

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise

Computation and Language

Teaches computers to grade essays like humans.

10 Oct 2025 0

View PDF Login to Bookmark

Country of Origin

🇯🇵 🇬🇧 United Kingdom, Japan

Page Count

6 pages

Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment

Helps computers grade essays with confidence.

Technical Abstract

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models

Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise