Understanding ProtChatGPT%3A Towards Understanding Proteins with Large Language Models

Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have shown significant progress in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. This work introduces ProtChatGPT, an AI-based protein chat system designed to learn and understand protein structures via natural languages. Users can upload proteins, ask questions, and engage in interactive conversations to receive comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoding and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM combines user questions with projected embeddings to generate informative answers. Experiments demonstrate that ProtChatGPT can produce promising responses to proteins and their corresponding questions, showing its potential for further exploration and application in protein research. The code and pre-trained model will be publicly available.Protein research is crucial in various fundamental disciplines, but understanding their intricate structure-function relationships remains challenging. Recent Large Language Models (LLMs) have shown significant progress in comprehending task-specific knowledge, suggesting the potential for ChatGPT-like systems specialized in protein to facilitate basic research. This work introduces ProtChatGPT, an AI-based protein chat system designed to learn and understand protein structures via natural languages. Users can upload proteins, ask questions, and engage in interactive conversations to receive comprehensive answers. The system comprises protein encoders, a Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and an LLM. The protein first undergoes protein encoding and PLP-former to produce protein embeddings, which are then projected by the adapter to conform with the LLM. The LLM combines user questions with projected embeddings to generate informative answers. Experiments demonstrate that ProtChatGPT can produce promising responses to proteins and their corresponding questions, showing its potential for further exploration and application in protein research. The code and pre-trained model will be publicly available.

ProtChatGPT: Towards Understanding Proteins with Large Language Models

15 Feb 2024 | Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang