pcaMethods—a bioconductor package providing PCA methods for incomplete data

pcaMethods—a bioconductor package providing PCA methods for incomplete data

Vol. 23 no. 9 2007, pages 1164–1167 | Wolfram Stacklies1, Henning Redestig2, Matthias Scholz3, Dirk Walther2 and Joachim Selbig2,4,*
The article introduces pcaMethods, a Bioconductor package designed for principal component analysis (PCA) on incomplete data sets. The package is particularly useful for microarray and metabolite data but can be applied to any incomplete data. It provides robust PCA methods that can handle missing values and estimate them, enabling the use of missing value-sensitive statistical methods. The package includes several PCA algorithms, such as probabilistic PCA (PPCA), Bayesian PCA (BPCA), inverse non-linear PCA, Nipals PCA, SVDimpate, and LLSimpute, each with different approaches to handling missing data. PPCA uses an expectation maximization (EM) approach with a probabilistic model, while BPCA uses a Bayesian estimation method and is especially suited for missing value estimation. BPCA is based on a variational Bayesian framework with automatic relevance determination, leading to different scaling of principal components compared to standard PCA. NLPCA is suitable for non-linear data and uses an auto-associative neural network. Nipals PCA is tolerant to small amounts of missing data. SVDimpate estimates missing values using eigengenes, and LLSimpute uses a linear combination of similar variables. The article discusses the performance of these methods, noting that BPCA generally provides the best missing value estimation accuracy, while NLPCA may be superior for non-linear data. PCA is the fastest method and is recommended for large data sets. The package also includes parameter estimation methods and visualization tools. The pcaMethods package is available in R and is part of the Bioconductor suite. It is integrated into the web-based tool MetaGeneAlyse for analyzing large-scale transcript and metabolite data. The package is supported by the Max Planck Society.The article introduces pcaMethods, a Bioconductor package designed for principal component analysis (PCA) on incomplete data sets. The package is particularly useful for microarray and metabolite data but can be applied to any incomplete data. It provides robust PCA methods that can handle missing values and estimate them, enabling the use of missing value-sensitive statistical methods. The package includes several PCA algorithms, such as probabilistic PCA (PPCA), Bayesian PCA (BPCA), inverse non-linear PCA, Nipals PCA, SVDimpate, and LLSimpute, each with different approaches to handling missing data. PPCA uses an expectation maximization (EM) approach with a probabilistic model, while BPCA uses a Bayesian estimation method and is especially suited for missing value estimation. BPCA is based on a variational Bayesian framework with automatic relevance determination, leading to different scaling of principal components compared to standard PCA. NLPCA is suitable for non-linear data and uses an auto-associative neural network. Nipals PCA is tolerant to small amounts of missing data. SVDimpate estimates missing values using eigengenes, and LLSimpute uses a linear combination of similar variables. The article discusses the performance of these methods, noting that BPCA generally provides the best missing value estimation accuracy, while NLPCA may be superior for non-linear data. PCA is the fastest method and is recommended for large data sets. The package also includes parameter estimation methods and visualization tools. The pcaMethods package is available in R and is part of the Bioconductor suite. It is integrated into the web-based tool MetaGeneAlyse for analyzing large-scale transcript and metabolite data. The package is supported by the Max Planck Society.
Reach us at info@study.space