Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages
By: Tewodros Kederalah Idris, Prasenjit Mitra, Roald Eiselen
Potential Business Impact:
Helps computers learn many languages faster.
Cross-lingual transfer is essential for building NLP systems for low-resource African languages, but practitioners lack reliable methods for selecting source languages. We systematically evaluate five embedding similarity metrics across 816 transfer experiments spanning three NLP tasks, three African-centric multilingual models, and 12 languages from four language families. We find that cosine gap and retrieval-based metrics (P@1, CSLS) reliably predict transfer success ($ρ= 0.4-0.6$), while CKA shows negligible predictive power ($ρ\approx 0.1$). Critically, correlation signs reverse when pooling across models (Simpson's Paradox), so practitioners must validate per-model. Embedding metrics achieve comparable predictive power to URIEL linguistic typology. Our results provide concrete guidance for source language selection and highlight the importance of model-specific analysis.
Similar Papers
Analyzing Similarity Metrics for Data Selection for Language Model Pretraining
Machine Learning (CS)
Helps computers learn better from text data.
Analyzing the Effect of Linguistic Similarity on Cross-Lingual Transfer: Tasks and Experimental Setups Matter
Computation and Language
Helps computers learn languages better, even rare ones.
What if I ask in \textit{alia lingua}? Measuring Functional Similarity Across Languages
Computation and Language
Makes AI understand questions the same way in many languages.