15 Feb 2024 | Chao Wang, Hehe Fan, Ruijie Quan, Yi Yang
ProtChatGPT is an AI-based system designed to facilitate protein research by leveraging large language models (LLMs) to understand and interact with protein data. The system enables users to upload protein sequences or structures and ask questions, with ProtChatGPT generating comprehensive answers based on the information provided. The system integrates protein encoders, a Protein-Language Pre-training Transformer (PLP-former), a projection adapter, and an LLM to process and interpret protein data. The protein encoders generate embeddings for sequences and structures, which are then processed by the PLP-former to align with language representations. The projection adapter converts these embeddings into prompts that the LLM can interpret, allowing the system to generate informative answers.
The system is trained using a two-stage approach: first, learning protein-description representations, and second, generating text from protein data. The first stage involves training the PLP-former on protein-description pairs to extract relevant features, while the second stage uses the PLP-former and a frozen 3D encoder to generate text-based responses. The system is evaluated on tasks such as protein understanding and design, demonstrating its effectiveness in providing accurate and logically consistent answers. The system also shows promise in distinguishing homologous proteins and proteins with mutually exclusive functions, highlighting its potential in protein research.
ProtChatGPT is designed to be a versatile tool for protein research, offering a platform for interactive dialogue and comprehensive insights into protein structures and functions. The system's ability to align multi-level protein features with LLMs makes it a valuable resource for researchers in the field of biomedicine and beyond. The system is expected to contribute to further advancements in protein research and inspire applications in other scientific disciplines.ProtChatGPT is an AI-based system designed to facilitate protein research by leveraging large language models (LLMs) to understand and interact with protein data. The system enables users to upload protein sequences or structures and ask questions, with ProtChatGPT generating comprehensive answers based on the information provided. The system integrates protein encoders, a Protein-Language Pre-training Transformer (PLP-former), a projection adapter, and an LLM to process and interpret protein data. The protein encoders generate embeddings for sequences and structures, which are then processed by the PLP-former to align with language representations. The projection adapter converts these embeddings into prompts that the LLM can interpret, allowing the system to generate informative answers.
The system is trained using a two-stage approach: first, learning protein-description representations, and second, generating text from protein data. The first stage involves training the PLP-former on protein-description pairs to extract relevant features, while the second stage uses the PLP-former and a frozen 3D encoder to generate text-based responses. The system is evaluated on tasks such as protein understanding and design, demonstrating its effectiveness in providing accurate and logically consistent answers. The system also shows promise in distinguishing homologous proteins and proteins with mutually exclusive functions, highlighting its potential in protein research.
ProtChatGPT is designed to be a versatile tool for protein research, offering a platform for interactive dialogue and comprehensive insights into protein structures and functions. The system's ability to align multi-level protein features with LLMs makes it a valuable resource for researchers in the field of biomedicine and beyond. The system is expected to contribute to further advancements in protein research and inspire applications in other scientific disciplines.