Score: 2

Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Published: January 9, 2026 | arXiv ID: 2601.05545v1

By: Hongjin Kim, Jeonghyun Kang, Harksoo Kim

Potential Business Impact:

Teaches computers to spot and score bad writing.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.

Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Computation and Language

Finds if students used AI to write essays.

11 Aug 2025 2

90%

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

Computation and Language

Helps computers grade essays as well as people.

16 Dec 2025 0

90%

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models

Computation and Language

Helps computers grade essays better, even with pictures.

17 Feb 2025 0

View PDF Login to Bookmark

Country of Origin

🇰🇷 Korea, Republic of

Repos / Data Links

github.com github.com github.com huggingface.co

Page Count

27 pages

Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Teaches computers to spot and score bad writing.

Technical Abstract

Assessing LLM Text Detection in Educational Contexts: Does Human Contribution Affect Detection?

Agreement Between Large Language Models and Human Raters in Essay Scoring: A Research Synthesis

EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models