MMseqs2: sensitive protein sequence searching for the analysis of massive data sets

MMseqs2: sensitive protein sequence searching for the analysis of massive data sets

June 7, 2017 | Martin Steinegger & Johannes Söding
MMseqs2 is an open-source software for sensitive protein sequence searching, designed for large-scale data analysis. It improves upon existing tools by achieving 100 times better sensitivity than PSI-BLAST at 400 times its speed. The software uses three stages: a k-mer match stage, vectorized alignment, and Smith-Waterman alignment. It detects consecutive k-mer matches on the same diagonal, allowing for higher sensitivity without sacrificing speed. MMseqs2 is parallelized across three levels, enabling efficient processing of large databases. It requires significant memory but can handle large datasets by distributing them across servers. The software is tested on benchmark datasets, showing superior performance in sensitivity and speed compared to other tools like BLAST, DIAMOND, and HMMER3. It is particularly effective in handling disordered and repeat regions, which are challenging for traditional methods. MMseqs2 is used for annotating large metagenomic datasets, improving the annotation of sequences by reducing false positives and increasing the fraction of annotated sequences. It is also effective in profile searches, outperforming PSI-BLAST in both speed and sensitivity. The software is suitable for various applications, including functional annotation of hypothetical proteins and clustering of sequence datasets. MMseqs2 addresses the computational bottleneck in metagenomic analysis by providing a balance between speed and sensitivity, making it a valuable tool for analyzing large-scale protein sequence data.MMseqs2 is an open-source software for sensitive protein sequence searching, designed for large-scale data analysis. It improves upon existing tools by achieving 100 times better sensitivity than PSI-BLAST at 400 times its speed. The software uses three stages: a k-mer match stage, vectorized alignment, and Smith-Waterman alignment. It detects consecutive k-mer matches on the same diagonal, allowing for higher sensitivity without sacrificing speed. MMseqs2 is parallelized across three levels, enabling efficient processing of large databases. It requires significant memory but can handle large datasets by distributing them across servers. The software is tested on benchmark datasets, showing superior performance in sensitivity and speed compared to other tools like BLAST, DIAMOND, and HMMER3. It is particularly effective in handling disordered and repeat regions, which are challenging for traditional methods. MMseqs2 is used for annotating large metagenomic datasets, improving the annotation of sequences by reducing false positives and increasing the fraction of annotated sequences. It is also effective in profile searches, outperforming PSI-BLAST in both speed and sensitivity. The software is suitable for various applications, including functional annotation of hypothetical proteins and clustering of sequence datasets. MMseqs2 addresses the computational bottleneck in metagenomic analysis by providing a balance between speed and sensitivity, making it a valuable tool for analyzing large-scale protein sequence data.
Reach us at info@study.space