Score: 3

Extending Automatic Machine Translation Evaluation to Book-Length Documents

Published: September 21, 2025 | arXiv ID: 2509.17249v1

By: Kuang-Da Wang , Shuoyang Ding , Chao-Han Huck Yang and more

BigTech Affiliations: NVIDIA

Potential Business Impact:

Tests if computers translate whole books well.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.

Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Computation and Language

Helps computers judge long translations better.

3 May 2025 3

89%

Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Computation and Language

Checks if long translations are good.

4 Sep 2025 1

88%

Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends

Computation and Language

Checks if computer translations are good.

21 Apr 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 🇹🇼 Taiwan, Province of China, United States

Repos / Data Links

github.com github.com

Page Count

17 pages

Extending Automatic Machine Translation Evaluation to Book-Length Documents

Tests if computers translate whole books well.

Technical Abstract

Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models

Align-then-Slide: A complete evaluation framework for Ultra-Long Document-Level Machine Translation

Automatic Evaluation Metrics for Document-level Translation: Overview, Challenges and Trends