Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG
By: Dayeon Ki , Marine Carpuat , Paul McNamee and more
Potential Business Impact:
Computers sometimes pick English answers even when wrong.
Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.
Similar Papers
Investigating Language Preference of Multilingual RAG Systems
Computation and Language
Makes AI understand and answer in many languages.
Enhancing Multilingual RAG Systems with Debiased Language Preference-Guided Query Fusion
Computation and Language
Makes computers understand all languages equally well.
Multilingual Retrieval-Augmented Generation for Knowledge-Intensive Task
Computation and Language
Helps computers answer questions in any language.