Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning
By: Bo Yuan, Jiazi Hu
Potential Business Impact:
Helps AI tutors give better, personalized learning help.
While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations within authentic learning scenarios remain limited. This study conducts an empirical comparison of three state-of-the-art LLMs on a tutoring task that simulates a realistic learning setting. Using a dataset comprising a student's answers to ten questions of mixed formats with correctness labels, each LLM is required to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student's mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, we employ Gemini as a virtual judge to perform pairwise comparisons along various dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model indicate that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, while DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological guidance for future empirical research on LLM-driven personalized learning.
Similar Papers
Large Language Models for Education and Research: An Empirical and User Survey-based Analysis
Artificial Intelligence
Helps students and researchers learn and solve problems.
Assessing Large Language Models for Automated Feedback Generation in Learning Programming Problem Solving
Software Engineering
AI helps teachers grade student code better.
Evaluation of LLMs for mathematical problem solving
Artificial Intelligence
Computers solve harder math problems better.