ENDOWING PROTEIN LANGUAGE MODELS WITH STRUCTURAL KNOWLEDGE

ENDOWING PROTEIN LANGUAGE MODELS WITH STRUCTURAL KNOWLEDGE

January 29, 2024 | Dexiong Chen, Philip Hartout, Paolo Pellizzoni, Carlos Oliver, Karsten Borgwardt
This paper introduces a novel framework called Protein Structure Transformer (PST) to enhance protein language models (PLMs) by integrating protein structural data. PST is built upon the ESM-2 model, a state-of-the-art sequence-based PLM, by incorporating structure extractor modules within its self-attention blocks. These modules allow PST to leverage structural information, enabling it to pretrain on a protein structure database, such as AlphaFoldDB, using the same masked language modeling objective as traditional PLMs. PST demonstrates superior parameter efficiency and performance compared to ESM-2, particularly in protein function prediction tasks, including enzyme and gene ontology classification. PST's ability to integrate structural information into PLMs opens new avenues for more effective and efficient protein modeling. The model is pretrained on a dataset of 542,378 predicted protein structures and is available for use at https://github.com/BorgwardtLab/PST. PST's architecture allows for the extraction of protein structure representations or fine-tuning for specific downstream applications. The model's performance is evaluated on various tasks, including function and structure prediction, and it outperforms existing models in these tasks. PST's integration of structural information into PLMs highlights the potential of combining sequence and structural data for improved protein modeling. The study also explores the impact of different pretraining strategies and the amount of structural information required for refining ESM-2 models. The results show that PST achieves state-of-the-art performance in protein function prediction and demonstrates the effectiveness of structure-aware models in discerning protein structural variations. The paper concludes that PST represents a significant advancement in protein representation learning, offering new insights into the relationship between protein sequence, structure, and function.This paper introduces a novel framework called Protein Structure Transformer (PST) to enhance protein language models (PLMs) by integrating protein structural data. PST is built upon the ESM-2 model, a state-of-the-art sequence-based PLM, by incorporating structure extractor modules within its self-attention blocks. These modules allow PST to leverage structural information, enabling it to pretrain on a protein structure database, such as AlphaFoldDB, using the same masked language modeling objective as traditional PLMs. PST demonstrates superior parameter efficiency and performance compared to ESM-2, particularly in protein function prediction tasks, including enzyme and gene ontology classification. PST's ability to integrate structural information into PLMs opens new avenues for more effective and efficient protein modeling. The model is pretrained on a dataset of 542,378 predicted protein structures and is available for use at https://github.com/BorgwardtLab/PST. PST's architecture allows for the extraction of protein structure representations or fine-tuning for specific downstream applications. The model's performance is evaluated on various tasks, including function and structure prediction, and it outperforms existing models in these tasks. PST's integration of structural information into PLMs highlights the potential of combining sequence and structural data for improved protein modeling. The study also explores the impact of different pretraining strategies and the amount of structural information required for refining ESM-2 models. The results show that PST achieves state-of-the-art performance in protein function prediction and demonstrates the effectiveness of structure-aware models in discerning protein structural variations. The paper concludes that PST represents a significant advancement in protein representation learning, offering new insights into the relationship between protein sequence, structure, and function.
Reach us at info@study.space
[slides] Endowing Protein Language Models with Structural Knowledge | StudySpace