Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
By: Priscylla Silva, Evandro Costa
Potential Business Impact:
AI helps teachers grade student code better.
Providing effective feedback is important for student learning in programming problem-solving. In this sense, Large Language Models (LLMs) have emerged as potential tools to automate feedback generation. However, their reliability and ability to identify reasoning errors in student code remain not well understood. This study evaluates the performance of four LLMs (GPT-4o, GPT-4o mini, GPT-4-Turbo, and Gemini-1.5-pro) on a benchmark dataset of 45 student solutions. We assessed the models' capacity to provide accurate and insightful feedback, particularly in identifying reasoning mistakes. Our analysis reveals that 63\% of feedback hints were accurate and complete, while 37\% contained mistakes, including incorrect line identification, flawed explanations, or hallucinated issues. These findings highlight the potential and limitations of LLMs in programming education and underscore the need for improvements to enhance reliability and minimize risks in educational applications.
Similar Papers
Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
Artificial Intelligence
Helps AI tutors give better, personalized learning help.
Generating Planning Feedback for Open-Ended Programming Exercises with LLMs
Computation and Language
Helps teachers grade code, even with mistakes.
Beyond Final Answers: Evaluating Large Language Models for Math Tutoring
Human-Computer Interaction
Helps computers teach math, but they make mistakes.