PROTLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

PROTLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

28 Feb 2024 | Le Zhuo, Zewen Chi, Minghao Xu, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang
PROTLLM is a versatile cross-modal large language model (LLM) designed for both protein-centric and protein-language tasks. It features a dynamic protein mounting mechanism that allows it to handle complex inputs where natural language text is interspersed with an arbitrary number of proteins. Additionally, PROTLLM employs a protein-as-word language modeling approach to train on a large-scale interleaved protein-text dataset called InterPT, which includes both structured data sources like protein annotations and unstructured data sources like biological research papers. This dataset enables PROTLLM to learn crucial knowledge for understanding proteins. The model is evaluated on classic supervised protein-centric tasks and explored for novel protein-language applications. Experimental results show that PROTLLM outperforms protein-specialized baselines on protein-centric tasks and demonstrates zero-shot and in-context learning capabilities on protein-language tasks. The model's architecture includes an LLM for natural language modeling, a protein encoder, and cross-modal connectors that enable the model to accept multimodal inputs. The dynamic protein mounting mechanism allows PROTLLM to process text interspersed with proteins, enabling it to handle diverse downstream tasks without re-designing task-specific architecture. The model is pre-trained on InterPT, a large-scale interleaved protein-text dataset, and is capable of performing tasks such as protein-protein interaction prediction and text-guided functional protein retrieval. The model's performance is validated through extensive experiments on various downstream tasks, demonstrating its effectiveness in both protein-centric and protein-language applications.PROTLLM is a versatile cross-modal large language model (LLM) designed for both protein-centric and protein-language tasks. It features a dynamic protein mounting mechanism that allows it to handle complex inputs where natural language text is interspersed with an arbitrary number of proteins. Additionally, PROTLLM employs a protein-as-word language modeling approach to train on a large-scale interleaved protein-text dataset called InterPT, which includes both structured data sources like protein annotations and unstructured data sources like biological research papers. This dataset enables PROTLLM to learn crucial knowledge for understanding proteins. The model is evaluated on classic supervised protein-centric tasks and explored for novel protein-language applications. Experimental results show that PROTLLM outperforms protein-specialized baselines on protein-centric tasks and demonstrates zero-shot and in-context learning capabilities on protein-language tasks. The model's architecture includes an LLM for natural language modeling, a protein encoder, and cross-modal connectors that enable the model to accept multimodal inputs. The dynamic protein mounting mechanism allows PROTLLM to process text interspersed with proteins, enabling it to handle diverse downstream tasks without re-designing task-specific architecture. The model is pre-trained on InterPT, a large-scale interleaved protein-text dataset, and is capable of performing tasks such as protein-protein interaction prediction and text-guided functional protein retrieval. The model's performance is validated through extensive experiments on various downstream tasks, demonstrating its effectiveness in both protein-centric and protein-language applications.
Reach us at info@study.space
[slides] ProtLLM%3A An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training | StudySpace