Understanding ProtLLM%3A An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

PROTLLM is a versatile cross-modal large language model (LLM) designed for both protein-centric and protein-language tasks. It features a dynamic protein mounting mechanism that enables it to handle complex inputs with an arbitrary number of proteins interspersed with natural language text. The model is trained using a protein-as-word language modeling approach, where it predicts both natural language and proteins from a specialized protein vocabulary. A large-scale interleaved protein-text dataset, named InterPT, is constructed for pre-training, encompassing structured data sources like protein annotations and unstructured data sources like biological research papers. Experimental results demonstrate that PROTLLM outperforms specialized baselines on protein-centric tasks and exhibits zero-shot and in-context learning capabilities on protein-language tasks. The model's ability to handle complex inputs and its comprehensive pre-training dataset make it a powerful tool for advancing bioscience research and artificial intelligence applications in protein understanding.PROTLLM is a versatile cross-modal large language model (LLM) designed for both protein-centric and protein-language tasks. It features a dynamic protein mounting mechanism that enables it to handle complex inputs with an arbitrary number of proteins interspersed with natural language text. The model is trained using a protein-as-word language modeling approach, where it predicts both natural language and proteins from a specialized protein vocabulary. A large-scale interleaved protein-text dataset, named InterPT, is constructed for pre-training, encompassing structured data sources like protein annotations and unstructured data sources like biological research papers. Experimental results demonstrate that PROTLLM outperforms specialized baselines on protein-centric tasks and exhibits zero-shot and in-context learning capabilities on protein-language tasks. The model's ability to handle complex inputs and its comprehensive pre-training dataset make it a powerful tool for advancing bioscience research and artificial intelligence applications in protein understanding.

PROTLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

28 Feb 2024 | Le Zhuo, Zewen Chi, Minghao Xu, Heyan Huang, Heqi Zheng, Conghui He, Xian-Ling Mao, Wentao Zhang