[slides] ProLLaMA%3A A Protein Language Model for Multi-Task Protein Language Processing

ProLLaMA is a protein language model designed for multi-task protein language processing. Unlike existing Protein Language Models (PLMs), ProLLaMA can handle both protein sequence generation and protein understanding tasks. The model is developed by transforming a general Large Language Model (LLM) into a PLM through a two-stage training framework. The first stage involves pre-training the LLM on a protein language dataset, while the second stage uses a multi-task instruction dataset to fine-tune the model. To improve training efficiency, Protein Vocabulary Pruning (PVP) is introduced, which reduces the vocabulary size and parameters of the model. ProLLaMA achieves state-of-the-art results in protein sequence generation and demonstrates strong performance in protein superfamily prediction, achieving a 62% exact match rate. The model is capable of generating novel proteins with desired functionalities and can follow user instructions to perform multiple tasks. The results show that ProLLaMA outperforms existing PLMs in both protein generation and understanding tasks. The model is available at https://github.com/PKU-YuanGroup/ProLLaMA and https://huggingface.co/GreatCaptainNemo.ProLLaMA is a protein language model designed for multi-task protein language processing. Unlike existing Protein Language Models (PLMs), ProLLaMA can handle both protein sequence generation and protein understanding tasks. The model is developed by transforming a general Large Language Model (LLM) into a PLM through a two-stage training framework. The first stage involves pre-training the LLM on a protein language dataset, while the second stage uses a multi-task instruction dataset to fine-tune the model. To improve training efficiency, Protein Vocabulary Pruning (PVP) is introduced, which reduces the vocabulary size and parameters of the model. ProLLaMA achieves state-of-the-art results in protein sequence generation and demonstrates strong performance in protein superfamily prediction, achieving a 62% exact match rate. The model is capable of generating novel proteins with desired functionalities and can follow user instructions to perform multiple tasks. The results show that ProLLaMA outperforms existing PLMs in both protein generation and understanding tasks. The model is available at https://github.com/PKU-YuanGroup/ProLLaMA and https://huggingface.co/GreatCaptainNemo.

ProLLaMA: A Protein Language Model for Multi-Task Protein Language Processing

16 Jul 2024 | Liuzhenghao Lv, Zongying Lin, Hao Li, Yuyang Liu, Jiaxi Cui, Calvin Yu-Chian Chen, Li Yuan, Yonghong Tian