21 May 2024 | Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
**Authors:** Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
**Institution:** National University of Singapore, University of Science and Technology of China
**Abstract:**
Language Models (LMs) excel in understanding textual descriptions of proteins, but struggle with raw protein data like amino acid sequences due to limited pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations but lack the ability to process texts effectively. To address these limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 integrates a PLM as a protein understanding module into an LM, enabling effective protein-to-text generation. This integration is facilitated by a cross-modal projector (Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we explore the unexplored field of protein-to-text generation. We establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 significantly outperforms current baselines, with ablation studies highlighting the efficacy of its core components.
**Contributions:**
- ProtT3: A new framework that integrates a PLM into an LM to enhance the LM's protein understanding ability, enabling effective protein-to-text generation.
- Establishes quantitative evaluations for protein-text modeling tasks, including protein captioning, protein QA, and protein-text retrieval.
- Achieves state-of-the-art performances across various tasks, surpassing baselines by significant margins.
**Related Work:**
- PLMs: Specialized LMs pretrained on protein sequences for protein understanding and generation.
- Protein-Text Modeling: Previous studies focus on protein property prediction and protein-text retrieval, lacking exploration in protein-to-text generation.
- Multi-modal LMs: Research on enabling LMs to understand other modalities like images, videos, and molecules.
**Model Architecture:**
- **Protein Language Model (PLM):** ESM-2, an encoder-only transformer LM pretrained on large corpora of protein sequences.
- **Language Model (LM):** Galactica, a decoder-only transformer LM pretrained on scientific papers.
- **Cross-modal Projector (Q-Former):** Aims to bridge the modality gap between the PLM and LM, enabling effective protein-to-text generation.
**Training Method:**
- **Stage 1: Protein-Text Retrieval Training:** Employs three cross-modal tasks (protein-text contrasting, protein-text matching, and protein captioning) to train the Q-Former.
- **Stage 2: Protein-to-Text Generation Training:**ProtT3: Protein-to-Text Generation for Text-based Protein Understanding
**Authors:** Zhiyuan Liu, An Zhang, Hao Fei, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua
**Institution:** National University of Singapore, University of Science and Technology of China
**Abstract:**
Language Models (LMs) excel in understanding textual descriptions of proteins, but struggle with raw protein data like amino acid sequences due to limited pretraining on such data. Conversely, Protein Language Models (PLMs) can understand and convert protein data into high-quality representations but lack the ability to process texts effectively. To address these limitations, we introduce ProtT3, a framework for Protein-to-Text Generation for Text-based Protein Understanding. ProtT3 integrates a PLM as a protein understanding module into an LM, enabling effective protein-to-text generation. This integration is facilitated by a cross-modal projector (Q-Former) that bridges the modality gap between the PLM's representation space and the LM's input space. Unlike previous studies focusing on protein property prediction and protein-text retrieval, we explore the unexplored field of protein-to-text generation. We establish quantitative evaluations for protein-text modeling tasks, including protein captioning, protein question-answering, and protein-text retrieval. Our experiments show that ProtT3 significantly outperforms current baselines, with ablation studies highlighting the efficacy of its core components.
**Contributions:**
- ProtT3: A new framework that integrates a PLM into an LM to enhance the LM's protein understanding ability, enabling effective protein-to-text generation.
- Establishes quantitative evaluations for protein-text modeling tasks, including protein captioning, protein QA, and protein-text retrieval.
- Achieves state-of-the-art performances across various tasks, surpassing baselines by significant margins.
**Related Work:**
- PLMs: Specialized LMs pretrained on protein sequences for protein understanding and generation.
- Protein-Text Modeling: Previous studies focus on protein property prediction and protein-text retrieval, lacking exploration in protein-to-text generation.
- Multi-modal LMs: Research on enabling LMs to understand other modalities like images, videos, and molecules.
**Model Architecture:**
- **Protein Language Model (PLM):** ESM-2, an encoder-only transformer LM pretrained on large corpora of protein sequences.
- **Language Model (LM):** Galactica, a decoder-only transformer LM pretrained on scientific papers.
- **Cross-modal Projector (Q-Former):** Aims to bridge the modality gap between the PLM and LM, enabling effective protein-to-text generation.
**Training Method:**
- **Stage 1: Protein-Text Retrieval Training:** Employs three cross-modal tasks (protein-text contrasting, protein-text matching, and protein captioning) to train the Q-Former.
- **Stage 2: Protein-to-Text Generation Training:**