Score: 0

Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

Published: May 19, 2025 | arXiv ID: 2505.13664v1

By: Ming Ding , Rasmus Kyng , Federico Solda and more

Potential Business Impact:

AI helps students solve hard computer problems better.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

As large language models (LLMs) advance, their role in higher education, particularly in free-response problem-solving, requires careful examination. This study assesses the performance of GPT-4o and o1-preview under realistic educational conditions in an undergraduate algorithms course. Anonymous GPT-generated solutions to take-home exams were graded by teaching assistants unaware of their origin. Our analysis examines both coarse-grained performance (scores) and fine-grained reasoning quality (error patterns). Results show that GPT-4o consistently struggles, failing to reach the passing threshold, while o1-preview performs significantly better, surpassing the passing score and even exceeding the student median in certain exercises. However, both models exhibit issues with unjustified claims and misleading arguments. These findings highlight the need for robust assessment strategies and AI-aware grading policies in education.

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Physics Education

AI solves physics problems better than students.

14 May 2025 0

90%

Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis

Computers and Society

Helps AI grade student homework much better.

22 Nov 2025 0

90%

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

Computation and Language

Computer grades student work like a teacher.

13 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇨🇭 Switzerland

Page Count

66 pages

Assessing GPT Performance in a Proof-Based University-Level Course Under Blind Grading

AI helps students solve hard computer problems better.

Technical Abstract

Evaluating GPT- and Reasoning-based Large Language Models on Physics Olympiad Problems: Surpassing Human Performance and Implications for Educational Assessment

Enhancing Large Language Models for Automated Homework Assessment in Undergraduate Circuit Analysis

LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation