Score: 1

Can Embedding Similarity Predict Cross-Lingual Transfer? A Systematic Study on African Languages

Published: January 6, 2026 | arXiv ID: 2601.03168v1

By: Tewodros Kederalah Idris, Prasenjit Mitra, Roald Eiselen

Potential Business Impact:

Helps computers learn many languages faster.

Business Areas:
Semantic Search Internet Services

Cross-lingual transfer is essential for building NLP systems for low-resource African languages, but practitioners lack reliable methods for selecting source languages. We systematically evaluate five embedding similarity metrics across 816 transfer experiments spanning three NLP tasks, three African-centric multilingual models, and 12 languages from four language families. We find that cosine gap and retrieval-based metrics (P@1, CSLS) reliably predict transfer success ($ρ= 0.4-0.6$), while CKA shows negligible predictive power ($ρ\approx 0.1$). Critically, correlation signs reverse when pooling across models (Simpson's Paradox), so practitioners must validate per-model. Embedding metrics achieve comparable predictive power to URIEL linguistic typology. Our results provide concrete guidance for source language selection and highlight the importance of model-specific analysis.

Country of Origin
🇺🇸 🇿🇦 South Africa, United States

Page Count
13 pages

Category
Computer Science:
Computation and Language