Microbiome Datasets Are Compositional: And This Is Not Optional

Microbiome Datasets Are Compositional: And This Is Not Optional

15 November 2017 | Gregory B. Gloor, Jean M. Macklaim, Vera Pawlowsky-Glahn, Juan J. Egozcue
Microbiome datasets are compositional, and this is not optional. High-throughput sequencing (HTS) data, such as 16S rRNA gene amplimers, metagenomes, or metatranscriptomes, are commonly used to study human disease, ecological differences, and the built environment. However, these datasets are compositional because the total read count is fixed by the instrument, and the data represent relative abundances rather than absolute counts. Many researchers are unaware of this or assume specific properties of compositional data, leading to misinterpretation of results. This review highlights the dangers of ignoring the compositional nature of microbiome data and emphasizes that HTS datasets should be treated as compositions throughout analysis. Compositional data are characterized by a constant sum, and their analysis requires specialized methods. Traditional methods, such as rarefaction or count normalization, can introduce biases and inaccuracies. For example, the total number of reads can confound distance or dissimilarity measures, and many distance metrics do not account for the compositional nature of the data. Correlation in compositional data is also unreliable due to negative correlation bias and instability upon subsetting. To address these issues, compositional data analysis (CODA) methods are recommended. These methods include ratio transformations, such as the centered log-ratio (clr) transformation, which linearizes and symmetrizes data for statistical analysis. Compositional replacements for distance metrics, such as the Aitchison distance and the phylogenetic transform, are also used. Additionally, compositional principal component analysis (PCA) biplots provide a more stable and interpretable representation of microbial community data. Differential abundance analysis should also account for the compositional nature of the data. Tools like ANCOM and ALDEx2 are designed for this purpose and reduce false positive rates. Overall, the analysis of microbiome datasets should incorporate compositional data analysis to ensure accurate and reliable results.Microbiome datasets are compositional, and this is not optional. High-throughput sequencing (HTS) data, such as 16S rRNA gene amplimers, metagenomes, or metatranscriptomes, are commonly used to study human disease, ecological differences, and the built environment. However, these datasets are compositional because the total read count is fixed by the instrument, and the data represent relative abundances rather than absolute counts. Many researchers are unaware of this or assume specific properties of compositional data, leading to misinterpretation of results. This review highlights the dangers of ignoring the compositional nature of microbiome data and emphasizes that HTS datasets should be treated as compositions throughout analysis. Compositional data are characterized by a constant sum, and their analysis requires specialized methods. Traditional methods, such as rarefaction or count normalization, can introduce biases and inaccuracies. For example, the total number of reads can confound distance or dissimilarity measures, and many distance metrics do not account for the compositional nature of the data. Correlation in compositional data is also unreliable due to negative correlation bias and instability upon subsetting. To address these issues, compositional data analysis (CODA) methods are recommended. These methods include ratio transformations, such as the centered log-ratio (clr) transformation, which linearizes and symmetrizes data for statistical analysis. Compositional replacements for distance metrics, such as the Aitchison distance and the phylogenetic transform, are also used. Additionally, compositional principal component analysis (PCA) biplots provide a more stable and interpretable representation of microbial community data. Differential abundance analysis should also account for the compositional nature of the data. Tools like ANCOM and ALDEx2 are designed for this purpose and reduce false positive rates. Overall, the analysis of microbiome datasets should incorporate compositional data analysis to ensure accurate and reliable results.
Reach us at info@study.space