[slides] Robust methods for differential abundance analysis in marker gene surveys

This supplementary note discusses the development and application of a zero-inflated Gaussian mixture model (ZIG) for differential abundance analysis in marker gene survey data. The ZIG model is motivated by the observed relationship between sequencing depth and the number of detected operational taxonomic units (OTUs). The model is designed to handle count data from two populations, each with multiple samples and features (OTUs), and accounts for zero-inflated data through a mixture of a point mass at zero and a Gaussian distribution. The Expectation-Maximization (EM) algorithm is used to estimate parameters, including fold-change estimates and standard errors, which are then used to construct moderated t-statistics for differential abundance testing. The note also includes a comparison of various differential abundance detection methods, such as Metastats, Lefse, DESeq, and edgeR, using oral microbiome data from the Human Metagenomics Project. The methods are evaluated based on their performance in detecting differentially abundant OTUs, fold-change estimates, and dispersion estimates. MetagenomeSeq and DESeq show high agreement in fold-change estimates, while edgeR and Lefse have different strengths and weaknesses. The note highlights the importance of controlling for confounding factors and the impact of sequencing depth on the detection of sparse features. Additionally, the note addresses the issue of ambiguous read assignment to OTUs, which can occur in RNAseq data and marker gene surveys. The authors discuss the implications of this ambiguity and present a method to mitigate it using a non-overlapping clustering approach. The note concludes with a discussion on the relevance of rarefaction and sparsity in metagenomic data and the potential for integrating the proposed methods into machine learning models for predictive analysis.This supplementary note discusses the development and application of a zero-inflated Gaussian mixture model (ZIG) for differential abundance analysis in marker gene survey data. The ZIG model is motivated by the observed relationship between sequencing depth and the number of detected operational taxonomic units (OTUs). The model is designed to handle count data from two populations, each with multiple samples and features (OTUs), and accounts for zero-inflated data through a mixture of a point mass at zero and a Gaussian distribution. The Expectation-Maximization (EM) algorithm is used to estimate parameters, including fold-change estimates and standard errors, which are then used to construct moderated t-statistics for differential abundance testing. The note also includes a comparison of various differential abundance detection methods, such as Metastats, Lefse, DESeq, and edgeR, using oral microbiome data from the Human Metagenomics Project. The methods are evaluated based on their performance in detecting differentially abundant OTUs, fold-change estimates, and dispersion estimates. MetagenomeSeq and DESeq show high agreement in fold-change estimates, while edgeR and Lefse have different strengths and weaknesses. The note highlights the importance of controlling for confounding factors and the impact of sequencing depth on the detection of sparse features. Additionally, the note addresses the issue of ambiguous read assignment to OTUs, which can occur in RNAseq data and marker gene surveys. The authors discuss the implications of this ambiguity and present a method to mitigate it using a non-overlapping clustering approach. The note concludes with a discussion on the relevance of rarefaction and sparsity in metagenomic data and the potential for integrating the proposed methods into machine learning models for predictive analysis.

Robust methods for differential abundance analysis of marker gene survey data

September 4, 2013 | Joseph Nathaniel Paulson, O. Colin Stine, Héctor Corrada Bravo, & Mihai Pop