DeProt: Protein language modeling with quantized structure and disentangled attention

DeProt: Protein language modeling with quantized structure and disentangled attention

April 17, 2024 | Mingchen Li, Yang Tan, Bozitao Zhong, Ziyi Zhou, Huiqun Yu, Xin Zhu Ma, Wanli Ouyang, Liang Hong, Bingxin Zhou, Pan Tan
DeProt is a novel protein language model that integrates protein sequence and structure information using a Transformer-based architecture. It is pre-trained on 18 million protein structures from AlphaFoldDB, with a focus on incorporating structural information through quantization and disentangled attention mechanisms. The model first encodes protein structures into local-structure sequences using a graph neural network-based auto-encoder, then quantizes these sequences into discrete structure tokens using a pre-trained codebook. Disentangled attention mechanisms are then used to integrate residue sequences with structure token sequences, enabling the model to capture the relationship between primary sequence, three-dimensional structure, and residue positions. DeProt outperforms other state-of-the-art protein language models in zero-shot mutant effect prediction and various downstream tasks, demonstrating its superior performance and robust representational capabilities. The model's structure quantization method reduces overfitting and enhances the integration of sequence and structure information. DeProt's architecture and quantization method are novel, offering significant improvements in protein representation. The model is pre-trained using masked language modeling and achieves state-of-the-art results on multiple protein-related tasks, including thermostability prediction, metal ion binding prediction, protein localization prediction, and GO annotations prediction. Ablation studies show that the model's performance is highly dependent on the structure vocab size and disentangled attention mechanisms, with K=2048 yielding the best results. DeProt's integration of structural information significantly enhances its performance in protein representation and prediction tasks.DeProt is a novel protein language model that integrates protein sequence and structure information using a Transformer-based architecture. It is pre-trained on 18 million protein structures from AlphaFoldDB, with a focus on incorporating structural information through quantization and disentangled attention mechanisms. The model first encodes protein structures into local-structure sequences using a graph neural network-based auto-encoder, then quantizes these sequences into discrete structure tokens using a pre-trained codebook. Disentangled attention mechanisms are then used to integrate residue sequences with structure token sequences, enabling the model to capture the relationship between primary sequence, three-dimensional structure, and residue positions. DeProt outperforms other state-of-the-art protein language models in zero-shot mutant effect prediction and various downstream tasks, demonstrating its superior performance and robust representational capabilities. The model's structure quantization method reduces overfitting and enhances the integration of sequence and structure information. DeProt's architecture and quantization method are novel, offering significant improvements in protein representation. The model is pre-trained using masked language modeling and achieves state-of-the-art results on multiple protein-related tasks, including thermostability prediction, metal ion binding prediction, protein localization prediction, and GO annotations prediction. Ablation studies show that the model's performance is highly dependent on the structure vocab size and disentangled attention mechanisms, with K=2048 yielding the best results. DeProt's integration of structural information significantly enhances its performance in protein representation and prediction tasks.
Reach us at info@study.space