Score: 1

Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data

Published: April 30, 2025 | arXiv ID: 2504.21747v1

By: Maxime Bouthors, Josep Crego, François Yvon

Potential Business Impact:

Improves computer translation using only one language.

Business Areas:

Natural Language Processing Artificial Intelligence, Data and Analytics, Software

Conventional retrieval-augmented neural machine translation (RANMT) systems leverage bilingual corpora, e.g., translation memories (TMs). Yet, in many settings, in-domain monolingual target-side corpora are often available. This work explores ways to take advantage of such resources by retrieving relevant segments directly in the target language, based on a source-side query. For this, we design improved cross-lingual retrieval systems, trained with both sentence level and word-level matching objectives. In our experiments with two RANMT architectures, we first demonstrate the benefits of such cross-lingual objectives in a controlled setting, obtaining translation performances that surpass standard TM-based models. We then showcase our method on a real-world set-up, where the target monolingual resources far exceed the amount of parallel data and observe large improvements of our new techniques, which outperform both the baseline setting, and general-purpose cross-lingual retrievers.