Efficient Data Reduction Via PCA-Guided Quantile Based Sampling
By: Foo Hui-Mean, Yuan-chin Ivan Chang
Potential Business Impact:
Makes computer models learn better from less data.
In large-scale statistical modeling, reducing data size through subsampling is essential for balancing computational efficiency and statistical accuracy. We propose a new method, Principal Component Analysis guided Quantile Sampling (PCA-QS), which projects data onto principal components and applies quantile-based sampling to retain representative and diverse subsets. Compared with uniform random sampling, leverage score sampling, and coreset methods, PCA-QS consistently achieves lower mean squared error and better preservation of key data characteristics, while also being computationally efficient. This approach is adaptable to a variety of data scenarios and shows strong potential for broad applications in statistical computing.
Similar Papers
Efficient and Intuitive Two-Phase Validation Across Multiple Models via Principal Components
Methodology
Finds the best people to check data.
Low-Precision Streaming PCA
Machine Learning (CS)
Makes computers learn faster with less memory.
Estimating the true number of principal components under the random design
Econometrics
Finds the best way to simplify complex data.