March 11, 2024 | Martin Hunt1,2,* Leandro Lima1,* Wei Shen1,3 John Lees1 Zamin Iqbal1,4,+
The paper introduces the AllTheBacteria project, which aims to create a comprehensive and accessible dataset of bacterial genomes. The project extends the 661,405 genomes assembled by Blackwell et al. in 2021 by 4.5 years, up to May 2023, tripling the number of genomes to 1,932,812. The new assemblies are uniformly processed for quality control and taxonomic abundance estimates based on the GTDB phylogeny. The data is compressed using an evolution-informed compression approach, reducing the size to 102GB, and multiple search indexes are provided for easy access. The project also outlines plans for future annotations and community contributions to enhance the dataset's utility for various research areas, including gene annotation, pangenome construction, and mobile element analysis. The initial release (v0.1) includes detailed methodology, software pipelines, and community involvement details, with future releases expected to add more features and analyses.The paper introduces the AllTheBacteria project, which aims to create a comprehensive and accessible dataset of bacterial genomes. The project extends the 661,405 genomes assembled by Blackwell et al. in 2021 by 4.5 years, up to May 2023, tripling the number of genomes to 1,932,812. The new assemblies are uniformly processed for quality control and taxonomic abundance estimates based on the GTDB phylogeny. The data is compressed using an evolution-informed compression approach, reducing the size to 102GB, and multiple search indexes are provided for easy access. The project also outlines plans for future annotations and community contributions to enhance the dataset's utility for various research areas, including gene annotation, pangenome construction, and mobile element analysis. The initial release (v0.1) includes detailed methodology, software pipelines, and community involvement details, with future releases expected to add more features and analyses.