Score: 0

An optimal two-step estimation approach for two-phase studies

Published: October 13, 2025 | arXiv ID: 2510.11587v1

By: Qingning Zhou, Kin Yau Wong

Potential Business Impact:

Improves how scientists learn from incomplete data.

Business Areas:

A/B Testing Data and Analytics

Two-phase sampling is commonly adopted for reducing cost and improving estimation efficiency. In many two-phase studies, the outcome and some cheap covariates are observed for a large sample in Phase I, and expensive covariates are obtained for a selected subset of the sample in Phase II. As a result, the analysis of the association between the outcome and covariates faces a missing data problem. Complete-case analysis, which relies solely on the Phase II sample, is generally inefficient. In this paper, we study a two-step estimation approach, which first obtains an estimator using the complete data, and then updates it using an asymptotically mean-zero estimator obtained from a working model between the outcome and cheap covariates using the full data. This two-step estimator is asymptotically at least as efficient as the complete-data estimator and is robust to misspecification of the working model. We propose a kernel-based method to construct a two-step estimator that achieves optimal efficiency. Additionally, we develop a simple joint update approach based on multiple working models to approximate the optimal estimator when a fully nonparametric kernel approach is infeasible. We illustrate the proposed methods with various outcome models. We demonstrate their advantages over existing approaches through simulation studies and provide an application to a major cancer genomics study.

Design and Analysis Considerations for Causal Inference under Two-Phase Sampling in Observational Studies

Methodology

Makes surveys more accurate with less money.

10 Nov 2025 0

88%

Efficient and Intuitive Two-Phase Validation Across Multiple Models via Principal Components

Methodology

Finds the best people to check data.

1 Dec 2025 1

87%

Scalable and Efficient Multiple Imputation for Case-Cohort Studies via Influence Function-Based Supersampling

Methodology

Makes expensive health tests faster and cheaper.

18 Nov 2025 1

View PDF Login to Bookmark

Country of Origin

🇭🇰 Hong Kong

Page Count

21 pages

An optimal two-step estimation approach for two-phase studies

Improves how scientists learn from incomplete data.

Technical Abstract

Design and Analysis Considerations for Causal Inference under Two-Phase Sampling in Observational Studies

Efficient and Intuitive Two-Phase Validation Across Multiple Models via Principal Components

Scalable and Efficient Multiple Imputation for Case-Cohort Studies via Influence Function-Based Supersampling