PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

PLMSearch: Protein language model powers accurate and fast sequence search for remote homology

30 March 2024 | Wei Liu, Ziyi Wang, Ronghui You, Chenghan Xie, Hong Wei, Yi Xiong, Jianyi Yang & Shanfeng Zhu
PLMSearch is a protein homology search method that uses a pre-trained protein language model to detect remote homology from sequence data. It outperforms traditional sequence search methods by using deep sequence embeddings and a similarity prediction model trained on structural similarity data. PLMSearch can search millions of protein pairs in seconds, with threefold higher sensitivity than existing methods, and is comparable to state-of-the-art structure-based search methods. It is particularly effective at identifying remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available and includes a pipeline for sequence filtering, similarity prediction, and alignment. The method is efficient, accurate, and suitable for large-scale protein sequence searches. It uses a protein language model to capture remote homology information from deep sequence embeddings, and a structural similarity predictor to train on structural similarity data. PLMSearch is able to achieve high sensitivity without relying on structural data, and is one of the fastest methods for large-scale protein sequence searches. It is also effective at detecting remote homology pairs that are difficult to identify with traditional sequence-based methods. PLMSearch is a promising tool for protein homology search, offering high accuracy and efficiency for large-scale sequence data.PLMSearch is a protein homology search method that uses a pre-trained protein language model to detect remote homology from sequence data. It outperforms traditional sequence search methods by using deep sequence embeddings and a similarity prediction model trained on structural similarity data. PLMSearch can search millions of protein pairs in seconds, with threefold higher sensitivity than existing methods, and is comparable to state-of-the-art structure-based search methods. It is particularly effective at identifying remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available and includes a pipeline for sequence filtering, similarity prediction, and alignment. The method is efficient, accurate, and suitable for large-scale protein sequence searches. It uses a protein language model to capture remote homology information from deep sequence embeddings, and a structural similarity predictor to train on structural similarity data. PLMSearch is able to achieve high sensitivity without relying on structural data, and is one of the fastest methods for large-scale protein sequence searches. It is also effective at detecting remote homology pairs that are difficult to identify with traditional sequence-based methods. PLMSearch is a promising tool for protein homology search, offering high accuracy and efficiency for large-scale sequence data.
Reach us at info@study.space