Score: 0

Debiased machine learning for combining probability and non-probability survey data

Published: August 12, 2025 | arXiv ID: 2508.08948v1

By: Shaun Seaman

Potential Business Impact:

Improves survey answers using two data sources.

We consider the problem of estimating the finite population mean $\bar{Y}$ of an outcome variable $Y$ using data from a nonprobability sample and auxiliary information from a probability sample. Existing double robust (DR) estimators of this mean $\bar{Y}$ require the estimation of two nuisance functions: the conditional probability of selection into the nonprobability sample given covariates $X$ that are observed in both samples, and the conditional expectation of $Y$ given $X$. These nuisance functions can be estimated using parametric models, but the resulting estimator of $\bar{Y}$ will typically be biased if both parametric models are misspecified. It would therefore be advantageous to be able to use more flexible data-adaptive / machine-learning estimators of the nuisance functions. Here, we develop a general framework for the valid use of DR estimators of $\bar{Y}$ when the design of the probability sample uses sampling without replacement at the first stage and data-adaptive / machine-learning estimators are used for the nuisance functions. We prove that several DR estimators of $\bar{Y}$, including targeted maximum likelihood estimators, are asymptotically normally distributed when the estimators of the nuisance functions converge faster than the $n^{1/4}$ rate and cross-fitting is used. We present a simulation study that demonstrates good performance of these DR estimators compared to the corresponding DR estimators that rely on at least one correctly specified parametric model.