Score: 0

Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

Published: December 14, 2025 | arXiv ID: 2512.12592v1

By: Tom Lee, Sihoon Lee, Seonghun Kim

Large Language Models (LLMs) challenge the validity of traditional open-ended assessments by blurring the lines of authorship. While recent research has focused on the accuracy of automated scoring (AES), these static approaches fail to capture process evidence or verify genuine student understanding. This paper introduces a novel Human-AI Collaboration framework that enhances assessment integrity by combining rubric-based automated scoring with AI-generated, targeted follow-up questions. In a pilot study with university instructors (N=9), we demonstrate that while Stage 1 (Auto-Scoring) ensures procedural fairness and consistency, Stage 2 (Interactive Verification) is essential for construct validity, effectively diagnosing superficial reasoning or unverified AI use. We report on the systems design, instructor perceptions of fairness versus validity, and the necessity of adaptive difficulty in follow-up questioning. The findings offer a scalable pathway for authentic assessment that moves beyond policing AI to integrating it as a synergistic partner in the evaluation process.

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Computers and Society

AI makes better tests for students and teachers.

9 Aug 2025 2

90%

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Computers and Society

AI can't reliably grade essays yet.

4 Aug 2025 0

90%

Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments

Computation and Language

Makes AI grading show its thinking, like a teacher.

21 Nov 2025 3

View PDF Login to Bookmark

Beyond Static Scoring: Enhancing Assessment Validity via AI-Generated Interactive Verification

Technical Abstract

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education

Principled Design of Interpretable Automated Scoring for Large-Scale Educational Assessments