Quantifying Language Disparities in Multilingual Large Language Models
By: Songbo Hu, Ivan Vulić, Anna Korhonen
Potential Business Impact:
Tests computer language fairness better, especially for rare languages.
Results reported in large-scale multilingual evaluations are often fragmented and confounded by factors such as target languages, differences in experimental setups, and model choices. We propose a framework that disentangles these confounding variables and introduces three interpretable metrics--the performance realisation ratio, its coefficient of variation, and language potential--enabling a finer-grained and more insightful quantification of actual performance disparities across both (i) models and (ii) languages. Through a case study of 13 model variants on 11 multilingual datasets, we demonstrate that our framework provides a more reliable measurement of model performance and language disparities, particularly for low-resource languages, which have so far proven challenging to evaluate. Importantly, our results reveal that higher overall model performance does not necessarily imply greater fairness across languages.
Similar Papers
Rethinking Cross-lingual Gaps from a Statistical Viewpoint
Computation and Language
Makes computer translations more accurate.
Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
Computation and Language
Fixes AI math problems for all languages.
Objective Metrics for Evaluating Large Language Models Using External Data Sources
Computation and Language
Tests computer smarts fairly and without bias.