Score: 2

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Published: April 18, 2025 | arXiv ID: 2504.13630v1

By: Shaomu Tan, Christof Monz

Potential Business Impact:

Checks if translations are good, even bad ones.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

A key challenge in MT evaluation is the inherent noise and inconsistency of human ratings. Regression-based neural metrics struggle with this noise, while prompting LLMs shows promise at system-level evaluation but performs poorly at segment level. In this work, we propose ReMedy, a novel MT metric framework that reformulates translation evaluation as a reward modeling task. Instead of regressing on imperfect human ratings directly, ReMedy learns relative translation quality using pairwise preference data, resulting in a more reliable evaluation. In extensive experiments across WMT22-24 shared tasks (39 language pairs, 111 MT systems), ReMedy achieves state-of-the-art performance at both segment- and system-level evaluation. Specifically, ReMedy-9B surpasses larger WMT winners and massive closed LLMs such as MetricX-13B, XCOMET-Ensemble, GEMBA-GPT-4, PaLM-540B, and finetuned PaLM2. Further analyses demonstrate that ReMedy delivers superior capability in detecting translation errors and evaluating low-quality translations.

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Computation and Language

Makes computer translations better and easier to understand.

21 Dec 2025 1

88%

Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Machine Learning (CS)

Cleans AI's learning data for better results.

15 May 2025 0

87%

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment

Artificial Intelligence

Teaches AI to follow instructions better.

6 Oct 2025 3

View PDF Login to Bookmark

Country of Origin

🇳🇱 Netherlands

Repos / Data Links

github.com github.com

Page Count

18 pages

Remedy: Learning Machine Translation Evaluation from Human Preferences with Reward Modeling

Checks if translations are good, even bad ones.

Technical Abstract

Remedy-R: Generative Reasoning for Machine Translation Evaluation without Error Annotations

Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment

Beyond Monolithic Rewards: A Hybrid and Multi-Aspect Reward Optimization for MLLM Alignment