[slides and audio] ProSST%3A Protein Language Modeling with Quantized Structure and Disentangled Attention

DeProt is a Transformer-based protein language model designed to incorporate both protein sequences and structures. It addresses the critical shortcoming of traditional protein language models, which lack explicit protein structure information, by pre-training on millions of protein structures from diverse natural protein clusters. DeProt serializes protein structures into residue-level local-structure sequences using a graph neural network-based auto-encoder and quantizes these vectors into discrete structure tokens using a pre-trained codebook. The model employs disentangled attention mechanisms to integrate residue sequences with structure token sequences, effectively capturing the relationship between protein sequences and their functionality. Despite having fewer parameters and less training data, DeProt outperforms other state-of-the-art protein language models, including those that are structure-aware and evolution-based, particularly in zero-shot mutant effect prediction tasks. Experimental results demonstrate that DeProt exhibits robust representational capabilities across various supervised-learning downstream tasks, highlighting its innovative framework and superior performance. The code, model weights, and associated datasets are available at: https://github.com/gimmn/DeProt.DeProt is a Transformer-based protein language model designed to incorporate both protein sequences and structures. It addresses the critical shortcoming of traditional protein language models, which lack explicit protein structure information, by pre-training on millions of protein structures from diverse natural protein clusters. DeProt serializes protein structures into residue-level local-structure sequences using a graph neural network-based auto-encoder and quantizes these vectors into discrete structure tokens using a pre-trained codebook. The model employs disentangled attention mechanisms to integrate residue sequences with structure token sequences, effectively capturing the relationship between protein sequences and their functionality. Despite having fewer parameters and less training data, DeProt outperforms other state-of-the-art protein language models, including those that are structure-aware and evolution-based, particularly in zero-shot mutant effect prediction tasks. Experimental results demonstrate that DeProt exhibits robust representational capabilities across various supervised-learning downstream tasks, highlighting its innovative framework and superior performance. The code, model weights, and associated datasets are available at: https://github.com/gimmn/DeProt.

DeProt: Protein language modeling with quantized structure and disentangled attention

April 17, 2024 | Mingchen Li, Yang Tan, Bozitao Zhong, Ziyi Zhou, Huiqun Yu, Xin Zhu Ma, Wanli Ouyang, Liang Hong, Bingxin Zhou, Pan Tan