Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling
By: Shaomu Tan, Christof Monz
Potential Business Impact:
Checks if translations are good, even bad ones.
A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.
Similar Papers
Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations
Computation and Language
Makes computer translations better and easier to understand.
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
Machine Learning (CS)
Cleans AI's learning data for better results.
Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment
Artificial Intelligence
Teaches AI to follow instructions better.