Genome sequence-based species delimitation with confidence intervals and improved distance functions

Genome sequence-based species delimitation with confidence intervals and improved distance functions

2013 | Jan P Meier-Kolthoff, Alexander F Auch, Hans-Peter Klenk and Markus Göker
This article presents a methodology for species delimitation in prokaryotes using genome sequence data, replacing the traditional DNA-DNA hybridization (DDH) method. The Genome Blast Distance Phylogeny (GBDP) approach calculates genome-to-genome distances using BLAST and other alignment tools, providing a digital, reliable estimate of genomic relatedness. GBDP was developed as a computational alternative to DDH, aiming to improve accuracy and consistency in species classification. The study evaluates GBDP's performance against DDH values, finding it to be the most accurate among tested methods. It introduces new features, including confidence intervals for intergenomic distances and improved distance functions. Confidence intervals derived from statistical models are more stable than those from resampling, highlighting the importance of statistical evaluation in in-silico DDH replacements. GBDP was further enhanced with a "coverage" algorithm, which improves the handling of overlapping HSPs and provides more accurate distance estimates. The study also explores various distance formulae and their impact on correlation with DDH values. The best-performing settings were determined through extensive testing, involving 4350 distinct configurations and 136 million genome comparisons. The results show that GBDP outperforms other methods in predicting DDH values, particularly when using the "d6" formula with BLAST+ and optimized parameters. Confidence intervals were calculated using bootstrapping and jackknifing, providing a statistical basis for evaluating in-silico DDH predictions. These methods allow for more reliable species delimitation by accounting for uncertainty in the data. The study also evaluates statistical models for DDH prediction, finding that generalized linear models (GLMs) provide more accurate and stable results than linear models. GLMs account for the non-linear relationship between genome distances and DDH values, offering improved predictions and error ratios. The use of log-transformed variables and quasi-binomial error families further enhances model performance. Overall, GBDP represents a significant advancement in genome-based species delimitation, offering a reliable and accurate alternative to traditional DDH methods. The inclusion of confidence intervals and improved statistical models ensures that in-silico DDH predictions are statistically valid, supporting a consistent and truly genome sequence-based classification of microorganisms. The web service at http://ggdc.dsmz.de provides access to these tools, facilitating their use in microbial taxonomy.This article presents a methodology for species delimitation in prokaryotes using genome sequence data, replacing the traditional DNA-DNA hybridization (DDH) method. The Genome Blast Distance Phylogeny (GBDP) approach calculates genome-to-genome distances using BLAST and other alignment tools, providing a digital, reliable estimate of genomic relatedness. GBDP was developed as a computational alternative to DDH, aiming to improve accuracy and consistency in species classification. The study evaluates GBDP's performance against DDH values, finding it to be the most accurate among tested methods. It introduces new features, including confidence intervals for intergenomic distances and improved distance functions. Confidence intervals derived from statistical models are more stable than those from resampling, highlighting the importance of statistical evaluation in in-silico DDH replacements. GBDP was further enhanced with a "coverage" algorithm, which improves the handling of overlapping HSPs and provides more accurate distance estimates. The study also explores various distance formulae and their impact on correlation with DDH values. The best-performing settings were determined through extensive testing, involving 4350 distinct configurations and 136 million genome comparisons. The results show that GBDP outperforms other methods in predicting DDH values, particularly when using the "d6" formula with BLAST+ and optimized parameters. Confidence intervals were calculated using bootstrapping and jackknifing, providing a statistical basis for evaluating in-silico DDH predictions. These methods allow for more reliable species delimitation by accounting for uncertainty in the data. The study also evaluates statistical models for DDH prediction, finding that generalized linear models (GLMs) provide more accurate and stable results than linear models. GLMs account for the non-linear relationship between genome distances and DDH values, offering improved predictions and error ratios. The use of log-transformed variables and quasi-binomial error families further enhances model performance. Overall, GBDP represents a significant advancement in genome-based species delimitation, offering a reliable and accurate alternative to traditional DDH methods. The inclusion of confidence intervals and improved statistical models ensures that in-silico DDH predictions are statistically valid, supporting a consistent and truly genome sequence-based classification of microorganisms. The web service at http://ggdc.dsmz.de provides access to these tools, facilitating their use in microbial taxonomy.
Reach us at info@study.space
Understanding Genome sequence-based species delimitation with confidence intervals and improved distance functions