Efficient RL for optimizing conversation level outcomes with an LLM-based tutor
By: Hyunji Nam , Omer Gottesman , Amy Zhang and more
Potential Business Impact:
Teaches students math better by remembering past lessons.
Large language models (LLMs) built on existing reinforcement learning with human feedback (RLHF) frameworks typically optimize responses based on immediate turn-level human preferences. However, this approach falls short in multi-turn dialogue settings, such as online math tutoring. We propose a method to enhance LLM-based tutors by representing the dialogue history with a lower-dimensional latent state representation of a student and optimizing a long-term policy to determine high-level actions based on the latent state. The goal is to better align the tutor's behavior with the long-term objective of guiding the student towards solving a target math problem on their own. Our model is lightweight, requiring less computational resources than prior work of training the tutor policy end-to-end to directly output the tutor's next utterance. Our experiment results demonstrate that these modifications lead to improved long-term outcomes compared to prompting in LLM-simulated tutoring tasks.
Similar Papers
Training LLM-based Tutors to Improve Student Learning Outcomes in Dialogues
Computation and Language
Teaches students better by guessing right answers.
Building Knowledge from Interactions: An LLM-Based Architecture for Adaptive Tutoring and Social Reasoning
Robotics
Robots learn to teach and remember like humans.
Beyond Final Answers: Evaluating Large Language Models for Math Tutoring
Human-Computer Interaction
Helps computers teach math, but they make mistakes.