Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
By: Bo Yuan, Jiazi Hu
Potential Business Impact:
Helps AI tutors give better, personalized learning help.
While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations within authentic learning scenarios remain limited. This study conducts an empirical comparison of three state-of-the-art LLMs on a tutoring task that simulates a realistic learning setting. Using a dataset comprising a student's answers to ten questions of mixed formats with correctness labels, each LLM is required to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student's mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, we employ Gemini as a virtual judge to perform pairwise comparisons along various dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model indicate that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, while DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological guidance for future empirical research on LLM-driven personalized learning.
Similar Papers
Evaluation of LLMs for mathematical problem solving
Artificial Intelligence
Computers solve harder math problems better.
A systematic comparison of Large Language Models for automated assignment assessment in programming education: Exploring the importance of architecture and vendor
Computers and Society
Computers grade student code, but not like teachers.
Can Large Language Models Match Tutoring System Adaptivity? A Benchmarking Study
Computation and Language
Computers can't teach as well as humans yet.