[slides] Inferring Correlation Networks from Genomic Survey Data

This study addresses the challenges of inferring correlation networks from genomic survey data (GSD), which are often derived from high-throughput sequencing techniques like 16S rRNA gene profiling. The authors highlight that GSD are compositional data, meaning they represent relative abundances rather than absolute abundances, leading to potential biases in correlation analysis. Traditional methods like Pearson correlation are invalid for such data because the relative nature of GSD causes artificial correlations, especially in low-diversity samples. The study demonstrates that compositional effects can produce spurious correlations, with true correlations sometimes appearing with opposite signs. Community diversity is identified as a key factor influencing the severity of these effects. To address these challenges, the authors introduce SparCC, a novel method that estimates correlations from compositional data by using log-transformed components and assuming sparsity in the correlation network. SparCC is shown to be highly accurate even in challenging data sets, outperforming standard methods in terms of correlation inference accuracy. The method was applied to the Human Microbiome Project (HMP) data, revealing that standard methods yield many spurious interactions and miss a significant number of true interactions. SparCC, in contrast, provides a more reliable inference of true correlations, particularly in low-diversity samples. The study emphasizes that compositional effects are not limited to Pearson correlations but also affect non-parametric methods like Spearman correlations. SparCC's ability to handle compositional data is validated through simulations and real-world data, demonstrating its robustness and effectiveness in inferring ecological networks from GSD. The results highlight the importance of using appropriate methods for compositional data to avoid misleading biological interpretations. The study concludes that SparCC represents a significant advancement in the analysis of microbial community data, offering a more accurate and reliable approach to inferring correlations from genomic survey data.This study addresses the challenges of inferring correlation networks from genomic survey data (GSD), which are often derived from high-throughput sequencing techniques like 16S rRNA gene profiling. The authors highlight that GSD are compositional data, meaning they represent relative abundances rather than absolute abundances, leading to potential biases in correlation analysis. Traditional methods like Pearson correlation are invalid for such data because the relative nature of GSD causes artificial correlations, especially in low-diversity samples. The study demonstrates that compositional effects can produce spurious correlations, with true correlations sometimes appearing with opposite signs. Community diversity is identified as a key factor influencing the severity of these effects. To address these challenges, the authors introduce SparCC, a novel method that estimates correlations from compositional data by using log-transformed components and assuming sparsity in the correlation network. SparCC is shown to be highly accurate even in challenging data sets, outperforming standard methods in terms of correlation inference accuracy. The method was applied to the Human Microbiome Project (HMP) data, revealing that standard methods yield many spurious interactions and miss a significant number of true interactions. SparCC, in contrast, provides a more reliable inference of true correlations, particularly in low-diversity samples. The study emphasizes that compositional effects are not limited to Pearson correlations but also affect non-parametric methods like Spearman correlations. SparCC's ability to handle compositional data is validated through simulations and real-world data, demonstrating its robustness and effectiveness in inferring ecological networks from GSD. The results highlight the importance of using appropriate methods for compositional data to avoid misleading biological interpretations. The study concludes that SparCC represents a significant advancement in the analysis of microbial community data, offering a more accurate and reliable approach to inferring correlations from genomic survey data.

Inferring Correlation Networks from Genomic Survey Data

September 20, 2012 | Jonathan Friedman, Eric J. Alm