Distributional Random Forests for Complex Survey Designs on Reproducing Kernel Hilbert Spaces
By: Yating Zou, Marcos Matabuena, Michael R. Kosorok
We study estimation of the conditional law $P(Y|X=\mathbf{x})$ and continuous functionals $Ψ(P(Y|X=\mathbf{x}))$ when $Y$ takes values in a locally compact Polish space, $X \in \mathbb{R}^p$, and the observations arise from a complex survey design. We propose a survey-calibrated distributional random forest (SDRF) that incorporates complex-design features via a pseudo-population bootstrap, PSU-level honesty, and a Maximum Mean Discrepancy (MMD) split criterion computed from kernel mean embeddings of Hájek-type (design-weighted) node distributions. We provide a framework for analyzing forest-style estimators under survey designs; establish design consistency for the finite-population target and model consistency for the super-population target under explicit conditions on the design, kernel, resampling multipliers, and tree partitions. As far as we are aware, these are the first results on model-free estimation of conditional distributions under survey designs. Simulations under a stratified two-stage cluster design provide finite sample performance and demonstrate the statistical error price of ignoring the survey design. The broad applicability of SDRF is demonstrated using NHANES: We estimate the tolerance regions of the conditional joint distribution of two diabetes biomarkers, illustrating how distributional heterogeneity can support subgroup-specific risk profiling for diabetes mellitus in the U.S. population.
Similar Papers
An RKHS Perspective on Tree Ensembles
Machine Learning (Stat)
Explains why computer learning models work so well.
Unified Distributed Estimation Framework for Sufficient Dimension Reduction Based on Conditional Moments
Methodology
Lets computers learn from data spread everywhere.
Minimum Hellinger Distance Estimators for Complex Survey Designs
Statistics Theory
Makes survey results more accurate, ignoring bad data.