April 2021 | Benjamin Buchfink, Klaus Reuter and Hajk-Georg Drost
The paper introduces an improved version of DIAMOND, a protein alignment tool, which significantly enhances performance and sensitivity for tree-of-life scale protein alignments. The new version of DIAMOND can perform large-scale protein alignments in hours, matching the sensitivity of the gold standard BLASTP. This improvement is achieved through optimized algorithms, distributed computing, and double indexing, which allow for efficient handling of massive datasets. DIAMOND now offers two sensitivity modes, --very-sensitive and --ultrasensitive, enabling data-intensive comparative genomics research, such as tree-of-life scale tracing of protein evolution, gene age inference, and functional annotation of genes and gene families, with the same accuracy as BLAST but with an 80–360-fold computational speedup. The new version of DIAMOND is available as open-source software under the GPL3 license. The paper also compares DIAMOND's performance against BLASTP and MMSeqs2 using a benchmark dataset of 1.7 million protein sequences, showing that DIAMOND is 12–15 times faster than MMSeqs2 while maintaining similar sensitivity. When compared to older versions of DIAMOND, the new version achieves a 6–8-fold speedup. DIAMOND is also faster than BLASTP, achieving an 8,000-fold speedup in the least sensitive mode and an 80-fold speedup in the ultra-sensitive mode. The paper also describes the algorithmic and computational improvements in DIAMOND, including double indexing, hamming distance filtering, ungapped extension, leftmost seed filter, adaptive ranking, gapped extension filter, chaining, and banded SWIPE. These improvements enable DIAMOND to handle large-scale protein alignments efficiently, with the ability to process 281 million sequences from the NCBI nr database against the UniRef50 database in under 18 hours. The paper concludes that DIAMOND is a powerful tool for sensitive tree-of-life scale protein alignments, with the potential to support the Earth BioGenome Project and other large-scale sequencing initiatives.The paper introduces an improved version of DIAMOND, a protein alignment tool, which significantly enhances performance and sensitivity for tree-of-life scale protein alignments. The new version of DIAMOND can perform large-scale protein alignments in hours, matching the sensitivity of the gold standard BLASTP. This improvement is achieved through optimized algorithms, distributed computing, and double indexing, which allow for efficient handling of massive datasets. DIAMOND now offers two sensitivity modes, --very-sensitive and --ultrasensitive, enabling data-intensive comparative genomics research, such as tree-of-life scale tracing of protein evolution, gene age inference, and functional annotation of genes and gene families, with the same accuracy as BLAST but with an 80–360-fold computational speedup. The new version of DIAMOND is available as open-source software under the GPL3 license. The paper also compares DIAMOND's performance against BLASTP and MMSeqs2 using a benchmark dataset of 1.7 million protein sequences, showing that DIAMOND is 12–15 times faster than MMSeqs2 while maintaining similar sensitivity. When compared to older versions of DIAMOND, the new version achieves a 6–8-fold speedup. DIAMOND is also faster than BLASTP, achieving an 8,000-fold speedup in the least sensitive mode and an 80-fold speedup in the ultra-sensitive mode. The paper also describes the algorithmic and computational improvements in DIAMOND, including double indexing, hamming distance filtering, ungapped extension, leftmost seed filter, adaptive ranking, gapped extension filter, chaining, and banded SWIPE. These improvements enable DIAMOND to handle large-scale protein alignments efficiently, with the ability to process 281 million sequences from the NCBI nr database against the UniRef50 database in under 18 hours. The paper concludes that DIAMOND is a powerful tool for sensitive tree-of-life scale protein alignments, with the potential to support the Earth BioGenome Project and other large-scale sequencing initiatives.