January 15, 2024 | Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song
xTrimoPGLM is a unified 100B-scale pre-trained transformer model designed for protein language understanding and generation. It addresses the limitations of existing models by combining autoencoding and autoregressive pre-training objectives, enabling the model to handle both protein understanding and generation tasks simultaneously. The model is trained on a large dataset of 940 million unique protein sequences, with 100 billion parameters and 1 trillion training tokens. xTrimoPGLM outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. The model can generate de novo protein sequences following the principles of natural ones and can perform programmable generation after supervised fine-tuning on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
The xTrimoPGLM framework combines the strengths of autoregressive blank infilling and bidirectional attention. It uses a Masked Language Model (MLM) objective to enhance its understanding capacity and a General Language Model (GLM) objective for generation. The model is trained on a large dataset of protein sequences and achieves low perplexity on out-of-distribution protein sequences. It also performs well in protein structure prediction, with xT-Fold achieving impressive TM-scores in CAMEO and CASP15 benchmarks. xTrimoPGLM-100B shows significant improvements in protein understanding tasks, surpassing previous state-of-the-art methods in 15 out of 18 tasks. The model also demonstrates the ability to generate de novo protein sequences with diverse structures and can be fine-tuned for specific structural and biophysical properties. However, the model faces challenges in generating sequences with specific properties or families, and there are limitations in its ability to handle out-of-distribution data. Despite these challenges, xTrimoPGLM shows great potential as a programmable model for exploring and synthesizing the vast protein space.xTrimoPGLM is a unified 100B-scale pre-trained transformer model designed for protein language understanding and generation. It addresses the limitations of existing models by combining autoencoding and autoregressive pre-training objectives, enabling the model to handle both protein understanding and generation tasks simultaneously. The model is trained on a large dataset of 940 million unique protein sequences, with 100 billion parameters and 1 trillion training tokens. xTrimoPGLM outperforms other advanced baselines in 18 protein understanding benchmarks across four categories. It also facilitates an atomic-resolution view of protein structures, leading to an advanced 3D structural prediction model that surpasses existing language model-based tools. The model can generate de novo protein sequences following the principles of natural ones and can perform programmable generation after supervised fine-tuning on curated sequences. These results highlight the substantial capability and versatility of xTrimoPGLM in understanding and generating protein sequences, contributing to the evolving landscape of foundation models in protein science.
The xTrimoPGLM framework combines the strengths of autoregressive blank infilling and bidirectional attention. It uses a Masked Language Model (MLM) objective to enhance its understanding capacity and a General Language Model (GLM) objective for generation. The model is trained on a large dataset of protein sequences and achieves low perplexity on out-of-distribution protein sequences. It also performs well in protein structure prediction, with xT-Fold achieving impressive TM-scores in CAMEO and CASP15 benchmarks. xTrimoPGLM-100B shows significant improvements in protein understanding tasks, surpassing previous state-of-the-art methods in 15 out of 18 tasks. The model also demonstrates the ability to generate de novo protein sequences with diverse structures and can be fine-tuned for specific structural and biophysical properties. However, the model faces challenges in generating sequences with specific properties or families, and there are limitations in its ability to handle out-of-distribution data. Despite these challenges, xTrimoPGLM shows great potential as a programmable model for exploring and synthesizing the vast protein space.