Long-context Reference-based MT Quality Estimation
By: Sami Ul Haq , Chinonso Cynthia Osuji , Sheila Castilho and more
Potential Business Impact:
Makes computer translations much better.
In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level Error Span Annotation (ESA) scores using augmented long-context data. To construct long-context training data, we concatenate in-domain, human-annotated sentences and compute a weighted average of their scores. We integrate multiple human judgment datasets (MQM, SQM, and DA) by normalising their scales and train multilingual regression models to predict quality scores from the source, hypothesis, and reference translations. Experimental results show that incorporating long-context information improves correlations with human judgments compared to models trained only on short segments.
Similar Papers
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Computation and Language
Helps computers judge good writing in many languages.
COMET-poly: Machine Translation Metric Grounded in Other Candidates
Computation and Language
Makes computer translations better by checking more options.
MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
Computation and Language
Makes computer translations better and find mistakes.