January 15, 2024 | Bo Chen, Xingyi Cheng, Pan Li, Yangli-ao Geng, Jing Gong, Shen Li, Zhilei Bei, Xu Tan, Boyan Wang, Xin Zeng, Chiming Liu, Aohan Zeng, Yuxiao Dong, Jie Tang, Le Song
The paper introduces xTrimoPGLM, a unified protein language model designed to address both protein understanding and generation tasks simultaneously. The model leverages a combination of Masked Language Model (MLM) and General Language Model (GLM) objectives to enhance its representational and generative capabilities. Trained with 100 billion parameters and 1 trillion tokens, xTrimoPGLM outperforms existing models in 18 protein understanding benchmarks across four categories: structure, interactions, functionality, and developability. It also demonstrates superior performance in 3D structural prediction, surpassing tools like ESMFold. Additionally, xTrimoPGLM can generate de novo protein sequences with diverse structures, achieving low sequence identity but high structural resemblance to natural proteins. The model's generative capabilities are further enhanced through supervised fine-tuning and reinforcement self-training, making it a versatile tool for protein sequence design. However, the high computational cost and challenges in out-of-distribution (OOD) performance are noted as limitations. Overall, xTrimoPGLM represents a significant advancement in the field of protein science, offering new possibilities for understanding and generating protein sequences.The paper introduces xTrimoPGLM, a unified protein language model designed to address both protein understanding and generation tasks simultaneously. The model leverages a combination of Masked Language Model (MLM) and General Language Model (GLM) objectives to enhance its representational and generative capabilities. Trained with 100 billion parameters and 1 trillion tokens, xTrimoPGLM outperforms existing models in 18 protein understanding benchmarks across four categories: structure, interactions, functionality, and developability. It also demonstrates superior performance in 3D structural prediction, surpassing tools like ESMFold. Additionally, xTrimoPGLM can generate de novo protein sequences with diverse structures, achieving low sequence identity but high structural resemblance to natural proteins. The model's generative capabilities are further enhanced through supervised fine-tuning and reinforcement self-training, making it a versatile tool for protein sequence design. However, the high computational cost and challenges in out-of-distribution (OOD) performance are noted as limitations. Overall, xTrimoPGLM represents a significant advancement in the field of protein science, offering new possibilities for understanding and generating protein sequences.