21 May 2024 | Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
ProtT3 is a framework for protein-to-text generation aimed at enhancing text-based protein understanding. It integrates a Protein Language Model (PLM) with a Language Model (LM) to enable the LM to understand protein sequences, allowing effective protein-to-text generation. A cross-modal projector, Q-Former, bridges the modality gap between the PLM's representation space and the LM's input space. This collaboration enables the LM to process proteins as inputs. ProtT3 addresses the limitations of existing methods by focusing on protein-to-text generation, which has been underexplored. The framework includes two training stages: protein-text retrieval and protein-to-text generation. The first stage involves three cross-modal tasks to enhance protein-text modeling, while the second stage trains the LM for text generation. ProtT3 achieves state-of-the-art performance in tasks such as protein captioning, protein question-answering, and protein-text retrieval. It outperforms existing baselines by significant margins, demonstrating the effectiveness of its core components. The framework is evaluated on datasets such as Swiss-Prot, ProteinKG25, and PDB-QA, showing substantial improvements in performance metrics like BLEU-2 scores and retrieval accuracy. ProtT3 also incorporates LoRA adapters for efficient fine-tuning, enabling effective adaptation to new tasks with minimal memory usage. The model's architecture and training methodology are detailed, highlighting its contributions to the field of protein-text modeling and multi-modal language understanding.ProtT3 is a framework for protein-to-text generation aimed at enhancing text-based protein understanding. It integrates a Protein Language Model (PLM) with a Language Model (LM) to enable the LM to understand protein sequences, allowing effective protein-to-text generation. A cross-modal projector, Q-Former, bridges the modality gap between the PLM's representation space and the LM's input space. This collaboration enables the LM to process proteins as inputs. ProtT3 addresses the limitations of existing methods by focusing on protein-to-text generation, which has been underexplored. The framework includes two training stages: protein-text retrieval and protein-to-text generation. The first stage involves three cross-modal tasks to enhance protein-text modeling, while the second stage trains the LM for text generation. ProtT3 achieves state-of-the-art performance in tasks such as protein captioning, protein question-answering, and protein-text retrieval. It outperforms existing baselines by significant margins, demonstrating the effectiveness of its core components. The framework is evaluated on datasets such as Swiss-Prot, ProteinKG25, and PDB-QA, showing substantial improvements in performance metrics like BLEU-2 scores and retrieval accuracy. ProtT3 also incorporates LoRA adapters for efficient fine-tuning, enabling effective adaptation to new tasks with minimal memory usage. The model's architecture and training methodology are detailed, highlighting its contributions to the field of protein-text modeling and multi-modal language understanding.