Protein language models learn evolutionary statistics of interacting sequence motifs

Protein language models learn evolutionary statistics of interacting sequence motifs

October 28, 2024 | Zhidian Zhang, Hannah K. Wayment-Steele, Garyk Brix, Haobo Wang, Dorothee Kern, and Sergey Ovchinnikov
Protein language models (pLMs) have shown significant potential in predicting and designing protein structures. However, it remains unclear how well they understand the biophysics of protein folding. This study investigates how the pLM ESM-2 predicts protein contacts and stores evolutionary information. The researchers found that ESM-2 stores statistics of coevolving residues, similar to simpler models like Markov Random Fields and Multivariate Gaussian models. They demonstrated that ESM-2 can recover predicted contacts by using local sequence windows, suggesting that pLMs predict contacts by storing motifs of pairwise contacts. The study highlights the limitations of current pLMs and emphasizes the importance of understanding their underlying mechanisms. The study also shows that pLMs may not fully understand the biophysics of protein folding, as they can predict unrealistic structures for protein isoforms. The researchers developed an unsupervised method to evaluate pLMs, comparing their coevolutionary statistics to linear models. They found that ESM-2 does not require full context for predicting interresidue contacts, indicating that it stores small coevolutionary models for each pair of interacting fragments. The study further reveals that pLMs predict structures by looking up segment pairings. Experiments showed that ESM-2 can recover contacts by unmasking flanking regions, with contact recovery being significantly more effective than random unmasking. The results suggest that pLMs learn statistics of sequence motifs and the relative separation between sequences, similar to prior approaches in predicting and designing protein structures. The study underscores the importance of understanding the underlying mechanisms of pLMs to improve their reliability in protein structure prediction. While pLMs have not yet reached the ability to directly model the physics of protein folding, this research and other interpretability studies provide insights into how deep learning can approximate the fundamentals of biophysics. The findings highlight the need for further improvements in pLMs, particularly in accurately predicting evolutionary effects such as multiple stable conformations.Protein language models (pLMs) have shown significant potential in predicting and designing protein structures. However, it remains unclear how well they understand the biophysics of protein folding. This study investigates how the pLM ESM-2 predicts protein contacts and stores evolutionary information. The researchers found that ESM-2 stores statistics of coevolving residues, similar to simpler models like Markov Random Fields and Multivariate Gaussian models. They demonstrated that ESM-2 can recover predicted contacts by using local sequence windows, suggesting that pLMs predict contacts by storing motifs of pairwise contacts. The study highlights the limitations of current pLMs and emphasizes the importance of understanding their underlying mechanisms. The study also shows that pLMs may not fully understand the biophysics of protein folding, as they can predict unrealistic structures for protein isoforms. The researchers developed an unsupervised method to evaluate pLMs, comparing their coevolutionary statistics to linear models. They found that ESM-2 does not require full context for predicting interresidue contacts, indicating that it stores small coevolutionary models for each pair of interacting fragments. The study further reveals that pLMs predict structures by looking up segment pairings. Experiments showed that ESM-2 can recover contacts by unmasking flanking regions, with contact recovery being significantly more effective than random unmasking. The results suggest that pLMs learn statistics of sequence motifs and the relative separation between sequences, similar to prior approaches in predicting and designing protein structures. The study underscores the importance of understanding the underlying mechanisms of pLMs to improve their reliability in protein structure prediction. While pLMs have not yet reached the ability to directly model the physics of protein folding, this research and other interpretability studies provide insights into how deep learning can approximate the fundamentals of biophysics. The findings highlight the need for further improvements in pLMs, particularly in accurately predicting evolutionary effects such as multiple stable conformations.
Reach us at info@study.space
[slides] Protein language models learn evolutionary statistics of interacting sequence motifs | StudySpace