Score: 0

Variable selection via knockoffs in missing data settings with categorical predictors

Published: August 8, 2025 | arXiv ID: 2508.06138v1

By: Silvia Bacci , Emanuela Dreassi , Leonardo Grilli and more

Potential Business Impact:

Finds important clues in messy student test data.

Large-scale assessment data typically include numerous categorical variables, often affected by missing values. Motivated by the challenges arising in this framework, we extend the knockoffs method for selecting predictors to settings with missing values. Our proposal relies on a preliminary phase consisting of multiple imputations of missing values. Each imputed dataset is then processed using a suitable knockoff filter. We evaluate the performance of the proposed method through a simulation study, showing satisfactory results consistent with a recently advocated cutting-edge method. We apply the method to large-scale assessment data collected by INVALSI about test scores of Italian students in grade 5 with many background variables. This case study is challenging, as most predictors have unordered categories, a setting not taken into account by traditional knockoffs methods. In addition, some of the key predictors are affected by missing values. The model includes random effects to account for the multilevel structure of students nested into schools. Our proposal to implement the knockoffs method within a multiple imputation framework proves to be feasible, flexible and effective.

Country of Origin
🇮🇹 Italy

Page Count
40 pages

Category
Statistics:
Methodology