[slides] Principal component analysis for clustering gene expression data

Principal component analysis (PCA) is often used before clustering gene expression data to reduce dimensionality. However, this study shows that using PCA may not always improve cluster quality and can sometimes degrade it. The research compared the quality of clusters obtained from original data versus data projected onto principal component axes using real and synthetic gene expression datasets. The results indicate that the first few principal components, which capture most of the data variation, do not necessarily capture most of the cluster structure. Clustering with PCA had different impacts on various clustering algorithms and similarity metrics. The study recommends against using PCA before clustering unless in special circumstances. It also shows that there exist other sets of principal components that may yield better clustering results than the first few. The effectiveness of PCA depends on the clustering algorithm and similarity metric used. The study concludes that PCA is not a reliable preprocessing step for clustering gene expression data without external criteria. The results suggest that choosing the right clustering algorithm and considering the specific characteristics of the data are more important than using PCA for dimensionality reduction. The study emphasizes the need for careful interpretation of cluster structures observed in reduced dimensional subspaces of PCA.Principal component analysis (PCA) is often used before clustering gene expression data to reduce dimensionality. However, this study shows that using PCA may not always improve cluster quality and can sometimes degrade it. The research compared the quality of clusters obtained from original data versus data projected onto principal component axes using real and synthetic gene expression datasets. The results indicate that the first few principal components, which capture most of the data variation, do not necessarily capture most of the cluster structure. Clustering with PCA had different impacts on various clustering algorithms and similarity metrics. The study recommends against using PCA before clustering unless in special circumstances. It also shows that there exist other sets of principal components that may yield better clustering results than the first few. The effectiveness of PCA depends on the clustering algorithm and similarity metric used. The study concludes that PCA is not a reliable preprocessing step for clustering gene expression data without external criteria. The results suggest that choosing the right clustering algorithm and considering the specific characteristics of the data are more important than using PCA for dimensionality reduction. The study emphasizes the need for careful interpretation of cluster structures observed in reduced dimensional subspaces of PCA.

Principal component analysis for clustering gene expression data

2001 | K. Y. Yeung* and W. L. Ruzzo