Score: 2

Boosting Data Utilization for Multilingual Dense Retrieval

Published: September 11, 2025 | arXiv ID: 2509.09459v1

By: Chao Huang , Fengran Mo , Yufeng Chen and more

Potential Business Impact:

Finds information in any language, faster.

Business Areas:

Semantic Search Internet Services

Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.