Semiparametric Efficient Data Integration Using the Dual-Frame Sampling Framework
By: Kosuke Morikawa, Jae Kwang Kim
Integrating probability and non-probability samples is increasingly important, yet unknown sampling mechanisms in non-probability sources complicate identification and efficient estimation. We develop semiparametric theory for dual-frame data integration and propose two complementary estimators. The first models the non-probability inclusion probability parametrically and attains the semiparametric efficiency bound. We introduce an identifiability condition based on strong monotonicity that identifies sampling-model parameters without instrumental variables, even under informative (non-ignorable) selection, using auxiliary information from the probability sample; it remains valid without record linkage between samples. The second estimator, motivated by a two-stage sampling approximation, avoids explicit modeling of the non-probability mechanism; though not fully efficient, it is efficient within a restricted augmentation class and is robust to misspecification. Simulations and an application to the Culture and Community in a Time of Crisis public simulation dataset show efficiency gains under correct specification and stable performance under misspecification and weak identification. Methods are implemented in the R package \texttt{dfSEDI}.
Similar Papers
Semiparametric Inference for Partially Identifiable Data Fusion Estimands via Double Machine Learning
Methodology
Combines data to learn about things not fully measured.
Survey Data Integration for Distribution Function Estimation
Statistics Theory
Helps use more data to understand groups better.
A Unified Framework for Semiparametrically Efficient Semi-Supervised Learning
Statistics Theory
Uses extra data to make computer guesses better.