Inferring Correlation Networks from Genomic Survey Data

Inferring Correlation Networks from Genomic Survey Data

September 20, 2012 | Jonathan Friedman, Eric J. Alm
The study by Friedman and Alm (2012) addresses the challenges of inferring correlations from genomic survey data, particularly 16S rRNA gene profiling, which can produce unreliable results due to compositional effects. Compositional data, such as relative fractions of genes or species, can lead to biased correlation estimates because the sum of these fractions is always 1. Standard correlation methods, like Pearson's correlation, are theoretically invalid for such data and can produce spurious correlations. The authors demonstrate that compositional effects are widespread and severe in real data sets from the Human Microbiome Project (HMP), where many correlations among taxa are artifactual, and true correlations may appear with opposite signs. To overcome these issues, the authors develop SparCC (Sparse Correlations for Compositional data), a novel method designed to estimate correlations from compositional data. SparCC uses a log-ratio transformation to convert compositional data into a form that can be analyzed using standard correlation methods. The method is based on the assumption that the number of components (e.g., OTUs or genes) is large and that the true correlation network is sparse. SparCC estimates the linear Pearson correlations between log-transformed components and provides a mapping from the transformed variables back to the original data. The authors evaluate SparCC using simulated and real data from the HMP, showing that it outperforms standard correlation methods in inferring true correlations, even in datasets with low diversity. They also demonstrate that SparCC can identify phylogenetically structured correlations, which are often missed by standard methods. The study highlights the importance of considering compositional effects in microbial ecology and provides a robust tool for analyzing genomic survey data.The study by Friedman and Alm (2012) addresses the challenges of inferring correlations from genomic survey data, particularly 16S rRNA gene profiling, which can produce unreliable results due to compositional effects. Compositional data, such as relative fractions of genes or species, can lead to biased correlation estimates because the sum of these fractions is always 1. Standard correlation methods, like Pearson's correlation, are theoretically invalid for such data and can produce spurious correlations. The authors demonstrate that compositional effects are widespread and severe in real data sets from the Human Microbiome Project (HMP), where many correlations among taxa are artifactual, and true correlations may appear with opposite signs. To overcome these issues, the authors develop SparCC (Sparse Correlations for Compositional data), a novel method designed to estimate correlations from compositional data. SparCC uses a log-ratio transformation to convert compositional data into a form that can be analyzed using standard correlation methods. The method is based on the assumption that the number of components (e.g., OTUs or genes) is large and that the true correlation network is sparse. SparCC estimates the linear Pearson correlations between log-transformed components and provides a mapping from the transformed variables back to the original data. The authors evaluate SparCC using simulated and real data from the HMP, showing that it outperforms standard correlation methods in inferring true correlations, even in datasets with low diversity. They also demonstrate that SparCC can identify phylogenetically structured correlations, which are often missed by standard methods. The study highlights the importance of considering compositional effects in microbial ecology and provides a robust tool for analyzing genomic survey data.
Reach us at info@study.space