RL from Teacher-Model Refinement: Gradual Imitation Learning for Machine Translation
By: Dongyub Jude Lee, Zhenyi Ye, Pengcheng He
Potential Business Impact:
Teaches computers to translate languages better.
Preference-learning methods for machine translation (MT)--such as Direct Preference Optimization (DPO)--have achieved impressive gains but depend heavily on large, carefully curated triplet datasets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), a novel framework that removes reliance on static triplets by leveraging continuous, high-quality feedback from an external teacher model (GPT-4o). RLfR frames each translation step as a micro-tutorial: the actor generates a hypothesis, the teacher refines it, and the actor is rewarded based on how closely it aligns with the teacher's refinement. Guided by two complementary signals--(i) negative edit distance, promoting lexical and structural fidelity, and (ii) COMET score, ensuring semantic adequacy--the actor progressively learns to emulate the teacher, mirroring a human learning process through incremental, iterative improvement. On the FLORES-200 benchmark (English to and from German, Spanish, Chinese, Korean, and Japanese), RLfR consistently outperforms both MT-SFT and preference-based baselines, significantly improving COMET (semantic adequacy) and M-ETA (entity preservation) scores.
Similar Papers
Reinforcement Learning Teachers of Test Time Scaling
Machine Learning (CS)
Teaches computers to explain answers better.
Language Models that Think, Chat Better
Computation and Language
Makes AI better at thinking and chatting.
A Technical Survey of Reinforcement Learning Techniques for Large Language Models
Artificial Intelligence
Teaches computers to follow instructions better.