Boosting Data Utilization for Multilingual Dense Retrieval
By: Chao Huang , Fengran Mo , Yufeng Chen and more
Potential Business Impact:
Finds information in any language, faster.
Multilingual dense retrieval aims to retrieve relevant documents across different languages based on a unified retriever model. The challenge lies in aligning representations of different languages in a shared vector space. The common practice is to fine-tune the dense retriever via contrastive learning, whose effectiveness highly relies on the quality of the negative sample and the efficacy of mini-batch data. Different from the existing studies that focus on developing sophisticated model architecture, we propose a method to boost data utilization for multilingual dense retrieval by obtaining high-quality hard negative samples and effective mini-batch data. The extensive experimental results on a multilingual retrieval benchmark, MIRACL, with 16 languages demonstrate the effectiveness of our method by outperforming several existing strong baselines.
Similar Papers
What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models
Information Retrieval
Find information in any language, even rare ones.
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
Computation and Language
Makes small computer models search languages better.
Multilingual Information Retrieval with a Monolingual Knowledge Base
Computation and Language
Helps computers find information in any language.