Understanding Training Compute-Optimal Protein Language Models

The paper explores the optimal training of protein language models, focusing on the best practices for efficient compute allocation. The authors investigate the scaling laws of Masked Language Models (MLM) and Causal Language Models (CLM) on Transformer architectures, tailored to the specific characteristics of protein sequence data. They use a massive dataset of 939 million protein sequences and train over 300 models with parameter sizes ranging from 3.5 million to 10.7 billion. Key findings include: 1. **Diminishing Returns and Overfitting**: The study observes diminishing returns for CLM and overfitting for MLM when using the Uniref database. To address this, the authors include metagenomic protein sequences to increase diversity and avoid overfitting. 2. **Scaling Laws**: They derive scaling laws for MLM and CLM, showing that training data scales sublinearly with model size but follows distinct power laws. MLM scales with a compute exponent of approximately 0.77, while CLM scales differently. 3. **Transfer Scaling**: The authors find that models trained with CLM can be transferred to MLM, with the optimal allocation of training tokens between the two objectives determined by the scaling laws and the Effectively Transferred Tokens (D_t). 4. **Experimental Validation**: The scaling laws are validated by comparing large-scale versions of ESM-2 and PROGEN2 on downstream tasks, including protein generation, structure prediction, and function-related tasks, all within similar or reduced pre-training compute budgets. The paper also discusses the importance of data quality and quantity, the impact of different training objectives, and the sensitivity of hyperparameters. The findings provide insights into optimizing the training of protein language models and can be applied to other biological data modalities.The paper explores the optimal training of protein language models, focusing on the best practices for efficient compute allocation. The authors investigate the scaling laws of Masked Language Models (MLM) and Causal Language Models (CLM) on Transformer architectures, tailored to the specific characteristics of protein sequence data. They use a massive dataset of 939 million protein sequences and train over 300 models with parameter sizes ranging from 3.5 million to 10.7 billion. Key findings include: 1. **Diminishing Returns and Overfitting**: The study observes diminishing returns for CLM and overfitting for MLM when using the Uniref database. To address this, the authors include metagenomic protein sequences to increase diversity and avoid overfitting. 2. **Scaling Laws**: They derive scaling laws for MLM and CLM, showing that training data scales sublinearly with model size but follows distinct power laws. MLM scales with a compute exponent of approximately 0.77, while CLM scales differently. 3. **Transfer Scaling**: The authors find that models trained with CLM can be transferred to MLM, with the optimal allocation of training tokens between the two objectives determined by the scaling laws and the Effectively Transferred Tokens (D_t). 4. **Experimental Validation**: The scaling laws are validated by comparing large-scale versions of ESM-2 and PROGEN2 on downstream tasks, including protein generation, structure prediction, and function-related tasks, all within similar or reduced pre-training compute budgets. The paper also discusses the importance of data quality and quantity, the impact of different training objectives, and the sensitivity of hyperparameters. The findings provide insights into optimizing the training of protein language models and can be applied to other biological data modalities.

Training Compute-Optimal Protein Language Models

June 9, 2024 | Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, Le Song