Novel Knockoff Generation and Importance Measures with Heterogeneous Data via Conditional Residuals and Local Gradients
By: Evan Mason, Zhe Fei
Potential Business Impact:
Finds important data in messy, mixed-up information.
Knockoff variable selection is a powerful framework that creates synthetic knockoff variables to mirror the correlation structure of the observed features, enabling principled control of the false discovery rate in variable selection. However, existing methods often assume homogeneous data types or known distributions, limiting their applicability in real-world settings with heterogeneous, distribution-free data. Moreover, common variable importance measures rely on linear outcome models, hindering their effectiveness for complex relationships. We propose a flexible knockoff generation framework based on conditional residuals that accommodates mixed data types without assuming known distributions. To assess variable importance, we introduce the Mean Absolute Local Derivative (MALD), an interpretable metric compatible with nonlinear outcome functions, including random forests and neural networks. Simulations show that our approach achieves better false discovery rate control and higher power than existing methods. We demonstrate its practical utility on a DNA methylation dataset from mouse tissues, identifying CpG sites linked to aging. Software is available in R (rangerKnockoff) and Python (MALDimportance).
Similar Papers
Variable selection via knockoffs in missing data settings with categorical predictors
Methodology
Finds important clues in messy student test data.
DiffKnock: Diffusion-based Knockoff Statistics for Neural Networks Inference
Methodology
Finds important genes in cell data.
Knockoffs for low dimensions: changing the nominal level post-hoc to gain power while controlling the FDR
Methodology
Finds hidden patterns more reliably in data.