Score: 0

Novel Knockoff Generation and Importance Measures with Heterogeneous Data via Conditional Residuals and Local Gradients

Published: August 20, 2025 | arXiv ID: 2508.14882v1

By: Evan Mason, Zhe Fei

Potential Business Impact:

Finds important data in messy, mixed-up information.

Business Areas:
A/B Testing Data and Analytics

Knockoff variable selection is a powerful framework that creates synthetic knockoff variables to mirror the correlation structure of the observed features, enabling principled control of the false discovery rate in variable selection. However, existing methods often assume homogeneous data types or known distributions, limiting their applicability in real-world settings with heterogeneous, distribution-free data. Moreover, common variable importance measures rely on linear outcome models, hindering their effectiveness for complex relationships. We propose a flexible knockoff generation framework based on conditional residuals that accommodates mixed data types without assuming known distributions. To assess variable importance, we introduce the Mean Absolute Local Derivative (MALD), an interpretable metric compatible with nonlinear outcome functions, including random forests and neural networks. Simulations show that our approach achieves better false discovery rate control and higher power than existing methods. We demonstrate its practical utility on a DNA methylation dataset from mouse tissues, identifying CpG sites linked to aging. Software is available in R (rangerKnockoff) and Python (MALDimportance).

Country of Origin
πŸ‡ΊπŸ‡Έ United States

Page Count
27 pages

Category
Statistics:
Methodology