January 14, 2025 | Sam Gelman, Bryce Johnson, Chase Freschlin, Arnav Sharma, Sameer D’Costa, John Peters, Anthony Gitter, Philip A. Romero
The paper introduces Mutational Effect Transfer Learning (METL), a framework that combines advanced machine learning and biophysical modeling to enhance protein engineering. METL pretrains transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. The pre-trained models are then fine-tuned on experimental sequence-function data to predict protein properties such as thermostability, catalytic activity, and fluorescence. METL excels in challenging tasks like generalizing from small training sets and extrapolating to mutations not observed in the training data. The authors demonstrate METL's ability to design functional green fluorescent protein (GFP) variants with only 64 sequence-function examples, showcasing its potential for incorporating biophysical knowledge into protein language models. The framework is evaluated on various datasets and compared with existing methods, highlighting its strengths in specific protein engineering tasks. The paper also discusses the relative information value of simulated versus experimental data and the importance of function-specific simulations in improving METL's performance. Overall, METL represents a significant step toward effectively integrating biophysical insights with machine learning-based protein fitness prediction.The paper introduces Mutational Effect Transfer Learning (METL), a framework that combines advanced machine learning and biophysical modeling to enhance protein engineering. METL pretrains transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure, and energetics. The pre-trained models are then fine-tuned on experimental sequence-function data to predict protein properties such as thermostability, catalytic activity, and fluorescence. METL excels in challenging tasks like generalizing from small training sets and extrapolating to mutations not observed in the training data. The authors demonstrate METL's ability to design functional green fluorescent protein (GFP) variants with only 64 sequence-function examples, showcasing its potential for incorporating biophysical knowledge into protein language models. The framework is evaluated on various datasets and compared with existing methods, highlighting its strengths in specific protein engineering tasks. The paper also discusses the relative information value of simulated versus experimental data and the importance of function-specific simulations in improving METL's performance. Overall, METL represents a significant step toward effectively integrating biophysical insights with machine learning-based protein fitness prediction.