When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content
By: Lydia Nishimwe, Benoît Sagot, Rachel Bawden
Potential Business Impact:
Helps computers translate messy online text better.
User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.
Similar Papers
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Computation and Language
Helps computers translate without guessing gender.
Testing the Limits of Machine Translation from One Book
Computation and Language
Helps computers translate rare languages better.
The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Computation and Language
Helps AI write better by checking its work.