Score: 0

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

Published: November 23, 2025 | arXiv ID: 2511.18597v1

By: H. M. Shadman Tabib, Jaber Ahmed Deedar

Potential Business Impact:

AI struggles to guess how hard computer problems are.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language and code generation, and are increasingly deployed as automatic judges of model outputs and learning activities. Yet, their behavior on structured tasks such as predicting the difficulty of competitive programming problems remains under-explored. We conduct a systematic comparison of GPT-4o, used purely as a natural-language difficulty assessor, against an interpretable Light-GBM ensemble trained on explicit numeric and textual features. On a dataset of 1,825 LeetCode problems labeled Easy, Medium, or Hard, LightGBM attains 86% accuracy, whereas GPT-4o reaches only 37.75%. Detailed analyses, including confusion matrices and SHAP-based interpretability, show that numeric constraints -- such as input size limits and acceptance rates -- play a crucial role in separating Hard problems from easier ones. By contrast, GPT-4o often overlooks these cues and exhibits a strong bias toward simpler categories. We further probe GPT-4o through a synthetic Hard-problem generation protocol. Surprisingly, GPT-4o labels almost all of its own synthetic Hard problems as Medium, contradicting its tendency to downgrade real Hard problems to Easy. Our findings connect to recent work on LLMs-as-judges and automatic difficulty estimation in programming and education, and highlight concrete failure modes that must be addressed before LLM-based judges can be considered trustworthy in competitive programming, educational platforms, or reinforcement-learning pipelines.

Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis

Computers and Society

Helps AI grade student homework much better.

22 Nov 2025 0

90%

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Physics Education

AI solves physics problems better than students.

14 May 2025 0

90%

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving

Software Engineering

AI helps teachers grade student code better.

18 Mar 2025 0

View PDF Login to Bookmark

Page Count

9 pages

Toward Trustworthy Difficulty Assessments: Large Language Models as Judges in Programming and Synthetic Tasks

AI struggles to guess how hard computer problems are.

Technical Abstract

Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving