Fast R Functions for Robust Correlations and Hierarchical Clustering

Fast R Functions for Robust Correlations and Hierarchical Clustering

March 2012, Volume 46, Issue 11 | Peter Langfelder, Steve Horvath
This paper presents fast functions for calculating Pearson and robust correlations, as well as hierarchical clustering in R. The functions are part of the updated R package WGCNA. The Pearson correlation function is optimized for speed, especially when dealing with data that has a small number of missing values. It uses fast matrix multiplication and parallel processing to achieve significant speed improvements. The biweight midcorrelation, a robust alternative to Pearson correlation, is also implemented with similar efficiency. The hierarchical clustering function in the flashClust package is optimized to achieve performance close to O(n²), which is much faster than the standard R function hclust, which has a worst-case complexity of O(n³). The paper also discusses the use of robust correlation in network analysis to reduce the impact of outliers. The biweight midcorrelation is shown to be more robust to outliers than Pearson correlation, and the paper provides examples of its use in gene expression data analysis. The paper includes timing comparisons showing that the fast functions significantly reduce computation time for large datasets. For example, the fast Pearson correlation function reduces the time needed to calculate correlations for over 23,000 probe sets from about 15 days to less than 9 hours. Similarly, the fast hierarchical clustering function reduces the time needed to cluster 20,000 variables from almost 4.6 hours to less than 1 minute. The paper also discusses the handling of missing data in the fast correlation functions. The functions allow for a trade-off between speed and accuracy by using an approximate method for missing data. The user can specify a maximum allowable proportion of missing data that can be handled approximately. The paper provides examples of how to use these functions and discusses their performance in different scenarios. The paper concludes that the fast functions for Pearson and biweight midcorrelation, as well as the fast hierarchical clustering function, provide significant computational advantages for large-scale data analysis in R. These functions are particularly useful for applications in genomics and bioinformatics where large datasets are common.This paper presents fast functions for calculating Pearson and robust correlations, as well as hierarchical clustering in R. The functions are part of the updated R package WGCNA. The Pearson correlation function is optimized for speed, especially when dealing with data that has a small number of missing values. It uses fast matrix multiplication and parallel processing to achieve significant speed improvements. The biweight midcorrelation, a robust alternative to Pearson correlation, is also implemented with similar efficiency. The hierarchical clustering function in the flashClust package is optimized to achieve performance close to O(n²), which is much faster than the standard R function hclust, which has a worst-case complexity of O(n³). The paper also discusses the use of robust correlation in network analysis to reduce the impact of outliers. The biweight midcorrelation is shown to be more robust to outliers than Pearson correlation, and the paper provides examples of its use in gene expression data analysis. The paper includes timing comparisons showing that the fast functions significantly reduce computation time for large datasets. For example, the fast Pearson correlation function reduces the time needed to calculate correlations for over 23,000 probe sets from about 15 days to less than 9 hours. Similarly, the fast hierarchical clustering function reduces the time needed to cluster 20,000 variables from almost 4.6 hours to less than 1 minute. The paper also discusses the handling of missing data in the fast correlation functions. The functions allow for a trade-off between speed and accuracy by using an approximate method for missing data. The user can specify a maximum allowable proportion of missing data that can be handled approximately. The paper provides examples of how to use these functions and discusses their performance in different scenarios. The paper concludes that the fast functions for Pearson and biweight midcorrelation, as well as the fast hierarchical clustering function, provide significant computational advantages for large-scale data analysis in R. These functions are particularly useful for applications in genomics and bioinformatics where large datasets are common.
Reach us at info@study.space