Training Compute-Optimal Protein Language Models

Training Compute-Optimal Protein Language Models

June 9, 2024 | Xingyi Cheng¹ Bo Chen¹² Pan Li¹ Jing Gong¹ Jie Tang² Le Song¹³
This paper explores the optimal training of protein language models (PLMs), focusing on balancing model size, training token numbers, and computational efficiency. The study uses a large dataset of 939 million protein sequences to train over 300 models with sizes ranging from 3.5 million to 10.7 billion parameters. The research investigates the scaling laws for two common training objectives: Causal Language Model (CLM) and Masked Language Model (MLM). It finds that CLM and MLM exhibit different scaling behaviors, with CLM showing more significant improvements in performance with increased model size, while MLM benefits more from increased training data. The study also demonstrates that models trained with CLM can be effectively transferred to MLM, with a trade-off in allocating training tokens between the two objectives to optimize overall performance. The research further validates these scaling laws by comparing large-scale versions of ESM-2 and PROGEN2 on various downstream tasks, showing that the proposed scaling strategies lead to improved performance across a range of protein-related tasks. The study highlights the importance of using diverse and large datasets to avoid overfitting and plateau effects in protein language modeling, and emphasizes the need for optimal allocation of computational resources to achieve efficient and effective training of PLMs. The findings provide valuable insights into the optimal training of protein language models, offering a framework for future research and development in this area.This paper explores the optimal training of protein language models (PLMs), focusing on balancing model size, training token numbers, and computational efficiency. The study uses a large dataset of 939 million protein sequences to train over 300 models with sizes ranging from 3.5 million to 10.7 billion parameters. The research investigates the scaling laws for two common training objectives: Causal Language Model (CLM) and Masked Language Model (MLM). It finds that CLM and MLM exhibit different scaling behaviors, with CLM showing more significant improvements in performance with increased model size, while MLM benefits more from increased training data. The study also demonstrates that models trained with CLM can be effectively transferred to MLM, with a trade-off in allocating training tokens between the two objectives to optimize overall performance. The research further validates these scaling laws by comparing large-scale versions of ESM-2 and PROGEN2 on various downstream tasks, showing that the proposed scaling strategies lead to improved performance across a range of protein-related tasks. The study highlights the importance of using diverse and large datasets to avoid overfitting and plateau effects in protein language modeling, and emphasizes the need for optimal allocation of computational resources to achieve efficient and effective training of PLMs. The findings provide valuable insights into the optimal training of protein language models, offering a framework for future research and development in this area.
Reach us at info@study.space
[slides and audio] Training Compute-Optimal Protein Language Models