MMseqs2: sensitive protein sequence searching for the analysis of massive data sets

MMseqs2: sensitive protein sequence searching for the analysis of massive data sets

June 7, 2017 | Martin Steinegger & Johannes Söding
The article introduces MMseqs2, an open-source software designed to address the challenge of sensitive protein sequence searching in large metagenomic datasets. Sequencing costs have dropped significantly, leading to the production of terabytes of sequences in metagenomic projects. However, computational costs now dominate, with protein searches consuming over 90% of resources. Traditional tools like BLAST are slow and less sensitive, while faster tools like DIAMOND have lower sensitivity. MMseqs2 improves on this by achieving better sensitivity than BLAST at a much faster speed, up to 400 times faster. MMseqs2 is composed of three stages: short word ("k-mer") match, vectorized ungapped alignment, and gapped alignment. The k-mer match stage is crucial for improved performance, detecting consecutive similar k-mer matches on the same diagonal. This allows MMseqs2 to use larger k-mers without losing sensitivity. The software is parallelized to scale well with multiple cores and servers, and it requires minimal random memory access, making it efficient for large databases. The authors benchmarked MMseqs2 using full-length sequences with disordered, low-complexity, and repeat regions, achieving high sensitivity and accuracy. MMseqs2 outperforms other tools like DIAMOND and PSI-BLAST in terms of sensitivity and speed. In practical applications, MMseqs2 was used to annotate proteins in the Ocean Microbiome Reference Gene Catalog (OM-RGC), annotating 78% of sequences with eggNOG domains in 1.5% of the time compared to BLAST. Additionally, it was used to annotate hypothetical proteins with Pfam domains, achieving 474 annotations with E-values below 0.001 in 8.3 hours, compared to 514 annotations with HMMER3 in 10.6 seconds. Overall, MMseqs2 closes the performance gap between sequencing and computational analysis, offering significant gains in speed and sensitivity, which should enable new possibilities for analyzing large data sets.The article introduces MMseqs2, an open-source software designed to address the challenge of sensitive protein sequence searching in large metagenomic datasets. Sequencing costs have dropped significantly, leading to the production of terabytes of sequences in metagenomic projects. However, computational costs now dominate, with protein searches consuming over 90% of resources. Traditional tools like BLAST are slow and less sensitive, while faster tools like DIAMOND have lower sensitivity. MMseqs2 improves on this by achieving better sensitivity than BLAST at a much faster speed, up to 400 times faster. MMseqs2 is composed of three stages: short word ("k-mer") match, vectorized ungapped alignment, and gapped alignment. The k-mer match stage is crucial for improved performance, detecting consecutive similar k-mer matches on the same diagonal. This allows MMseqs2 to use larger k-mers without losing sensitivity. The software is parallelized to scale well with multiple cores and servers, and it requires minimal random memory access, making it efficient for large databases. The authors benchmarked MMseqs2 using full-length sequences with disordered, low-complexity, and repeat regions, achieving high sensitivity and accuracy. MMseqs2 outperforms other tools like DIAMOND and PSI-BLAST in terms of sensitivity and speed. In practical applications, MMseqs2 was used to annotate proteins in the Ocean Microbiome Reference Gene Catalog (OM-RGC), annotating 78% of sequences with eggNOG domains in 1.5% of the time compared to BLAST. Additionally, it was used to annotate hypothetical proteins with Pfam domains, achieving 474 annotations with E-values below 0.001 in 8.3 hours, compared to 514 annotations with HMMER3 in 10.6 seconds. Overall, MMseqs2 closes the performance gap between sequencing and computational analysis, offering significant gains in speed and sensitivity, which should enable new possibilities for analyzing large data sets.
Reach us at info@study.space
[slides] MMseqs2%3A sensitive protein sequence searching for the analysis of massive data sets | StudySpace