[slides and audio] pcaMethods - a bioconductor package providing PCA methods for incomplete data

The pcaMethods package is a Bioconductor-compliant library for performing principal component analysis (PCA) on incomplete data. It allows for the analysis of PCA results directly or for estimating missing values to enable the use of missing value-sensitive statistical methods. The package was primarily developed for microarray and metabolite data but can be applied to any incomplete dataset. The package includes several PCA methods that are robust to missing data and allow for missing value estimation. These include Probabilistic PCA (PPCA), Bayesian PCA (BPCA), Inverse Non-linear PCA (NLPCA), Nipals PCA, SVDimpute, and LLSimpute. PPCA is the fastest method and is recommended for large datasets. BPCA provides the best missing value estimation accuracy on average, while NLPCA may be superior when data contain strong non-linear dependencies. SVDimpute and Nipals are widely used standard approaches. PCA requires mean centering, as it is based on the calculation of the covariance matrix. Variance normalization can affect PCA results, and scaling to unit variance may be useful when comparing variables of different units or intensity ranges. However, for microarray data, genes not expressed at all must be removed before scaling to avoid unnecessary noise. Missing value estimation is crucial for subsequent statistical analyses. The package provides methods for estimating missing values, including BPCA, SVDimpute, NLPCA, and LLSimpute. The choice of the optimal number of principal components or neighbors can be determined using cross-validation. The pcaMethods package is written in R and is part of the Bioconductor suite of packages. It is available for use in the R environment and can be integrated with web-based tools like MetaGeneAlyse for visualization and analysis of large-scale transcript and metabolite data. The package provides a common object called pcaRes for maximum interoperability. The package is available at http://www.bioconductor.org.The pcaMethods package is a Bioconductor-compliant library for performing principal component analysis (PCA) on incomplete data. It allows for the analysis of PCA results directly or for estimating missing values to enable the use of missing value-sensitive statistical methods. The package was primarily developed for microarray and metabolite data but can be applied to any incomplete dataset. The package includes several PCA methods that are robust to missing data and allow for missing value estimation. These include Probabilistic PCA (PPCA), Bayesian PCA (BPCA), Inverse Non-linear PCA (NLPCA), Nipals PCA, SVDimpute, and LLSimpute. PPCA is the fastest method and is recommended for large datasets. BPCA provides the best missing value estimation accuracy on average, while NLPCA may be superior when data contain strong non-linear dependencies. SVDimpute and Nipals are widely used standard approaches. PCA requires mean centering, as it is based on the calculation of the covariance matrix. Variance normalization can affect PCA results, and scaling to unit variance may be useful when comparing variables of different units or intensity ranges. However, for microarray data, genes not expressed at all must be removed before scaling to avoid unnecessary noise. Missing value estimation is crucial for subsequent statistical analyses. The package provides methods for estimating missing values, including BPCA, SVDimpute, NLPCA, and LLSimpute. The choice of the optimal number of principal components or neighbors can be determined using cross-validation. The pcaMethods package is written in R and is part of the Bioconductor suite of packages. It is available for use in the R environment and can be integrated with web-based tools like MetaGeneAlyse for visualization and analysis of large-scale transcript and metabolite data. The package provides a common object called pcaRes for maximum interoperability. The package is available at http://www.bioconductor.org.

pcaMethods—a bioconductor package providing PCA methods for incomplete data

2007 | Wolfram Stacklies¹, Henning Redestig², Matthias Scholz³, Dirk Walther² and Joachim Selbig²,⁴