Pearmut: Human Evaluation of Translation Made Trivial
By: Vilém Zouhar, Tom Kocmi
Potential Business Impact:
Makes checking computer translations easy and fast.
Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics, because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, or MQM, but is also extensible to allow prototyping new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and active learning-based assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.
Similar Papers
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
Computation and Language
Helps computers judge good writing in many languages.
A Gamified Evaluation and Recruitment Platform for Low Resource Language Machine Translation Systems
Computation and Language
Helps translate rare languages better with games.
A Critical Study of Automatic Evaluation in Sign Language Translation
Computation and Language
Helps computers judge sign language videos better.