Training and Testing with Multiple Splits: A Central Limit Theorem for Split-Sample Estimators
By: Bruno Fava
Potential Business Impact:
Improves computer learning by using data smarter.
As predictive algorithms grow in popularity, using the same dataset to both train and test a new model has become routine across research, policy, and industry. Sample-splitting attains valid inference on model properties by using separate subsamples to estimate the model and to evaluate it. However, this approach has two drawbacks, since each task uses only part of the data, and different splits can lead to widely different estimates. Averaging across multiple splits, I develop an inference approach that uses more data for training, uses the entire sample for testing, and improves reproducibility. I address the statistical dependence from reusing observations across splits by proving a new central limit theorem for a large class of split-sample estimators under arguably mild and general conditions. Importantly, I make no restrictions on model complexity or convergence rates. I show that confidence intervals based on the normal approximation are valid for many applications, but may undercover in important cases of interest, such as comparing the performance between two models. I develop a new inference approach for such cases, explicitly accounting for the dependence across splits. Moreover, I provide a measure of reproducibility for p-values obtained from split-sample estimators. Finally, I apply my results to two important problems in development and public economics: predicting poverty and learning heterogeneous treatment effects in randomized experiments. I show that my inference approach with repeated cross-fitting achieves better power than previous alternatives, often enough to find statistical significance that would otherwise be missed.
Similar Papers
A Honest Cross-Validation Estimator for Prediction Performance
Machine Learning (Stat)
Improves how well computer predictions work.
Inference for Forecasting Accuracy: Pooled versus Individual Estimators in High-dimensional Panel Data
Methodology
Helps choose best way to study groups of people.
CLT in high-dimensional Bayesian linear regression with low SNR
Statistics Theory
Helps understand data when signals are weak.