March 11, 2024 | Martin Hunt, Leandro Lima, Wei Shen, John Lees, Zamin Iqbal
The AllTheBacteria project provides a comprehensive, uniformly assembled, and searchable database of bacterial genomes. This dataset includes 1,932,812 genome assemblies, combining new data from 2018 to 2023 with the previously released 661,405 genomes. The data is processed to ensure quality control and taxonomic abundance estimates based on the GTDB phylogeny. The genomes are compressed using an evolution-informed approach, resulting in a total size of 102Gb in xz archives. The dataset includes multiple search indexes for efficient querying. The project aims to improve upon the previous dataset by involving the research community in annotation and analysis of specific bacterial species. The data is made available through an open-source pipeline and is accessible via a public repository. The project also includes plans for future releases, including additional annotations and indexes. The dataset is important for understanding bacterial evolution, diversity, and their impact on global ecology. The data is available for researchers to study bacterial genomes and their functional elements. The project emphasizes collaboration and community involvement to enhance the utility and accessibility of the data for various research fields.The AllTheBacteria project provides a comprehensive, uniformly assembled, and searchable database of bacterial genomes. This dataset includes 1,932,812 genome assemblies, combining new data from 2018 to 2023 with the previously released 661,405 genomes. The data is processed to ensure quality control and taxonomic abundance estimates based on the GTDB phylogeny. The genomes are compressed using an evolution-informed approach, resulting in a total size of 102Gb in xz archives. The dataset includes multiple search indexes for efficient querying. The project aims to improve upon the previous dataset by involving the research community in annotation and analysis of specific bacterial species. The data is made available through an open-source pipeline and is accessible via a public repository. The project also includes plans for future releases, including additional annotations and indexes. The dataset is important for understanding bacterial evolution, diversity, and their impact on global ecology. The data is available for researchers to study bacterial genomes and their functional elements. The project emphasizes collaboration and community involvement to enhance the utility and accessibility of the data for various research fields.