MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task
By: Juraj Juraska , Tobias Domhan , Mara Finkelstein and more
Potential Business Impact:
Makes computer translations better and find mistakes.
In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
Similar Papers
Long-context Reference-based MT Quality Estimation
Computation and Language
Makes computer translations much better.
In2x at WMT25 Translation Task
Computation and Language
Helps computers translate rare languages well.
AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training
Computation and Language
Finds nice online talk to make things kinder.