Score: 0

Novel Knockoff Generation and Importance Measures with Heterogeneous Data via Conditional Residuals and Local Gradients

Published: August 20, 2025 | arXiv ID: 2508.14882v1

By: Evan Mason, Zhe Fei

Potential Business Impact:

Finds important data in messy, mixed-up information.

Business Areas:

A/B Testing Data and Analytics

Knockoff variable selection is a powerful framework that creates synthetic knockoff variables to mirror the correlation structure of the observed features, enabling principled control of the false discovery rate in variable selection. However, existing methods often assume homogeneous data types or known distributions, limiting their applicability in real-world settings with heterogeneous, distribution-free data. Moreover, common variable importance measures rely on linear outcome models, hindering their effectiveness for complex relationships. We propose a flexible knockoff generation framework based on conditional residuals that accommodates mixed data types without assuming known distributions. To assess variable importance, we introduce the Mean Absolute Local Derivative (MALD), an interpretable metric compatible with nonlinear outcome functions, including random forests and neural networks. Simulations show that our approach achieves better false discovery rate control and higher power than existing methods. We demonstrate its practical utility on a DNA methylation dataset from mouse tissues, identifying CpG sites linked to aging. Software is available in R (rangerKnockoff) and Python (MALDimportance).

Variable selection via knockoffs in missing data settings with categorical predictors

Methodology

Finds important clues in messy student test data.

8 Aug 2025 0

87%

DiffKnock: Diffusion-based Knockoff Statistics for Neural Networks Inference

Methodology

Finds important genes in cell data.

1 Oct 2025 0

87%

Knockoffs for low dimensions: changing the nominal level post-hoc to gain power while controlling the FDR

Methodology

Finds hidden patterns more reliably in data.

14 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇺🇸 United States

Page Count

27 pages

Novel Knockoff Generation and Importance Measures with Heterogeneous Data via Conditional Residuals and Local Gradients

Finds important data in messy, mixed-up information.

Technical Abstract

Variable selection via knockoffs in missing data settings with categorical predictors

DiffKnock: Diffusion-based Knockoff Statistics for Neural Networks Inference

Knockoffs for low dimensions: changing the nominal level post-hoc to gain power while controlling the FDR