January 14, 2025 | Sam Gelman, Bryce Johnson, Chase Freschlin, Arnav Sharma, Sameer D'Costa, John Peters, Anthony Gitter, and Philip A. Romero
This paper introduces Mutational Effect Transfer Learning (METL), a biophysics-based protein language model framework that integrates advanced machine learning with biophysical modeling. METL pretrains transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. It is then fine-tuned on experimental sequence-function data to harness biophysical signals for predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks such as generalizing from small training sets and extrapolating to mutations not observed in the training data. The study demonstrates METL's ability to design functional green fluorescent protein (GFP) variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
The METL framework involves three steps: synthetic data generation, synthetic data pretraining, and experimental data fine-tuning. Synthetic data is generated using molecular modeling, and a transformer-based PLM is pretrained on this data to capture biophysical knowledge. The model is then fine-tuned using experimental sequence-function data to produce biophysics-aware models that can predict specific protein properties. METL-Local and METL-Global are two pretraining strategies that specialize across different scales of protein sequence space. METL-Local learns a protein representation targeted to a specific protein of interest, while METL-Global learns a general protein representation applicable to any protein of interest.
The study evaluates METL's predictive generalization performance on 11 experimental datasets, representing proteins of varying sizes, folds, and functions. METL outperforms existing methods on small training sets and is particularly effective in extrapolation tasks such as mutation, position, regime, and score extrapolation. The results show that METL-Local and Linear-EVE perform well on small training sets, while METL-Global and ESM-2 remain competitive for mid-sized training sets. METL's ability to incorporate biophysical knowledge and its performance in extrapolation tasks highlight its potential for protein engineering.
The study also explores the relative information value of simulated versus experimental data, showing that simulated data can partially compensate for a lack of experimental data. The results indicate that METL's pretraining on simulated data improves its performance, especially for larger proteins. The study further demonstrates that function-specific simulations can improve the initial pretrained METL model and its performance after fine-tuning. METL's ability to incorporate function-specific molecular modeling and simulations is a key strength, as demonstrated by its improved performance when pretrained on GB1-IgG binding data.
The study also shows that METL can design functional GFP variants when trained on only 64 examples, highlighting its potential for protein engineering. METL's success in the Unobserved AA design setting, where the model must infer the effects of mutations it has not observed, is particularly remarkable. The study also shows that METL's biophysical prior can indirectly improve designsThis paper introduces Mutational Effect Transfer Learning (METL), a biophysics-based protein language model framework that integrates advanced machine learning with biophysical modeling. METL pretrains transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. It is then fine-tuned on experimental sequence-function data to harness biophysical signals for predicting protein properties like thermostability, catalytic activity, and fluorescence. METL excels in challenging protein engineering tasks such as generalizing from small training sets and extrapolating to mutations not observed in the training data. The study demonstrates METL's ability to design functional green fluorescent protein (GFP) variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
The METL framework involves three steps: synthetic data generation, synthetic data pretraining, and experimental data fine-tuning. Synthetic data is generated using molecular modeling, and a transformer-based PLM is pretrained on this data to capture biophysical knowledge. The model is then fine-tuned using experimental sequence-function data to produce biophysics-aware models that can predict specific protein properties. METL-Local and METL-Global are two pretraining strategies that specialize across different scales of protein sequence space. METL-Local learns a protein representation targeted to a specific protein of interest, while METL-Global learns a general protein representation applicable to any protein of interest.
The study evaluates METL's predictive generalization performance on 11 experimental datasets, representing proteins of varying sizes, folds, and functions. METL outperforms existing methods on small training sets and is particularly effective in extrapolation tasks such as mutation, position, regime, and score extrapolation. The results show that METL-Local and Linear-EVE perform well on small training sets, while METL-Global and ESM-2 remain competitive for mid-sized training sets. METL's ability to incorporate biophysical knowledge and its performance in extrapolation tasks highlight its potential for protein engineering.
The study also explores the relative information value of simulated versus experimental data, showing that simulated data can partially compensate for a lack of experimental data. The results indicate that METL's pretraining on simulated data improves its performance, especially for larger proteins. The study further demonstrates that function-specific simulations can improve the initial pretrained METL model and its performance after fine-tuning. METL's ability to incorporate function-specific molecular modeling and simulations is a key strength, as demonstrated by its improved performance when pretrained on GB1-IgG binding data.
The study also shows that METL can design functional GFP variants when trained on only 64 examples, highlighting its potential for protein engineering. METL's success in the Unobserved AA design setting, where the model must infer the effects of mutations it has not observed, is particularly remarkable. The study also shows that METL's biophysical prior can indirectly improve designs