Do we Need Dozens of Methods for Real World Missing Value Imputation?
By: Krystyna Grzesiak , Christophe Muller , Julie Josse and more
Potential Business Impact:
Finds better ways to fill in missing data.
Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. While many studies compare imputation approaches, they often focus on a limited subset of algorithms and evaluate performance primarily through pointwise metrics such as RMSE, which are not suitable to measure the preservation of the true data distribution. In this work, we provide a systematic benchmarking method based on the idea of treating imputation as a distributional prediction task. We consider a large number of algorithms and, for the first time, evaluate them not only on synthetic missing mechanisms, but also on real-world missingness scenarios, using the concept of Imputation Scores. Finally, while the focus of previous benchmark has often been on numerical data, we also consider mixed data sets in our study. The analysis overwhelmingly confirms the superiority of iterative imputation algorithms, especially the methods implemented in the mice R package.
Similar Papers
An Interdisciplinary and Cross-Task Review on Missing Data Imputation
Machine Learning (Stat)
Fixes broken data for better computer decisions.
Beyond Accuracy: An Empirical Study of Uncertainty Estimation in Imputation
Databases
Makes computer guesses about missing info more trustworthy.
Evaluation of Missing Data Imputation for Time Series Without Ground Truth
Machine Learning (CS)
Fixes broken phone data without needing original info.