Principal Component Analysis When n < p: Challenges and Solutions
By: Nuwan Weeraratne, Lyn Hunt, Jason Kurz
Potential Business Impact:
Makes computer analysis better with messy, complex data.
Principal Component Analysis is a key technique for reducing the complexity of high-dimensional data while preserving its fundamental data structure, ensuring models remain stable and interpretable. This is achieved by transforming the original variables into a new set of uncorrelated variables (principal components) based on the covariance structure of the original variables. However, since the traditional maximum likelihood covariance estimator does not accurately converge to the true covariance matrix, the standard principal component analysis performs poorly as a dimensionality reduction technique in high-dimensional scenarios $n<p$. In this study, inspired by a fundamental issue associated with mean estimation when $n<p$, we proposed a novel estimation called pairwise differences covariance estimation with four regularized versions of it to address the issues with the principal component analysis when n < p high dimensional data settings. In empirical comparisons with existing methods (maximum likelihood estimation and its best alternative method called Ledoit-Wolf estimation) and the proposed method(s), all the proposed regularized versions of pairwise differences covariance estimation perform well compared to those well-known estimators in estimating the covariance and principal components while minimizing the PCs' overdispersion and cosine similarity error. Real data applications are presented.
Similar Papers
Highly robust factored principal component analysis for matrix-valued outlier accommodation and explainable detection via matrix minimum covariance determinant
Methodology
Finds bad data points in complex pictures.
Beyond Regularization: Inherently Sparse Principal Component Analysis
Methodology
Finds hidden patterns in complex information.
Large-dimensional Factor Analysis with Weighted PCA
Methodology
Improves computer analysis of complex data.