19 January 2016 | Ian T. Jolliffe and Jorge Cadima
Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of large datasets while preserving as much variability as possible. It creates new uncorrelated variables, called principal components (PCs), that successively maximize variance. PCA is adaptive because it is data-driven, not based on predefined basis functions, and has been adapted for various data types and structures. The method involves solving an eigenvalue/eigenvector problem or using singular value decomposition (SVD) of the data matrix. PCA can be based on either the covariance or correlation matrix, with the choice depending on the data's scale and units. The first few PCs are typically used for interpretation, while subsequent ones may be relevant in specific contexts like outlier detection or image analysis. PCA is also used in fields such as atmospheric science, where it is known as empirical orthogonal function (EOF) analysis. Recent developments include robust PCA (RPCA), which addresses the sensitivity of PCA to outliers by decomposing data into low-rank and sparse components. Other adaptations include simplified PCA for better interpretability, such as SCoTLASS, which uses LASSO-like constraints to drive some loadings to zero, resulting in sparse components. These adaptations highlight the versatility of PCA in handling diverse data types and improving interpretability while maintaining the core goal of dimensionality reduction.Principal component analysis (PCA) is a statistical technique used to reduce the dimensionality of large datasets while preserving as much variability as possible. It creates new uncorrelated variables, called principal components (PCs), that successively maximize variance. PCA is adaptive because it is data-driven, not based on predefined basis functions, and has been adapted for various data types and structures. The method involves solving an eigenvalue/eigenvector problem or using singular value decomposition (SVD) of the data matrix. PCA can be based on either the covariance or correlation matrix, with the choice depending on the data's scale and units. The first few PCs are typically used for interpretation, while subsequent ones may be relevant in specific contexts like outlier detection or image analysis. PCA is also used in fields such as atmospheric science, where it is known as empirical orthogonal function (EOF) analysis. Recent developments include robust PCA (RPCA), which addresses the sensitivity of PCA to outliers by decomposing data into low-rank and sparse components. Other adaptations include simplified PCA for better interpretability, such as SCoTLASS, which uses LASSO-like constraints to drive some loadings to zero, resulting in sparse components. These adaptations highlight the versatility of PCA in handling diverse data types and improving interpretability while maintaining the core goal of dimensionality reduction.