[slides and audio] Protein language models learn evolutionary statistics of interacting sequence motifs

Protein language models (pLMs) have emerged as powerful tools for predicting and designing protein structures, but their fundamental understanding of protein biophysics remains unclear. This study investigates the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). By using a "categorical Jacobian" calculation, the authors demonstrate that ESM-2 stores statistics of coevolving residues, similar to simpler models like Markov Random Fields and Multivariate Gaussian models. They further explore how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies and find that providing local windows of sequence information allows ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. The study highlights the limitations of current pLMs and emphasizes the importance of understanding their underlying mechanisms.Protein language models (pLMs) have emerged as powerful tools for predicting and designing protein structures, but their fundamental understanding of protein biophysics remains unclear. This study investigates the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). By using a "categorical Jacobian" calculation, the authors demonstrate that ESM-2 stores statistics of coevolving residues, similar to simpler models like Markov Random Fields and Multivariate Gaussian models. They further explore how ESM-2 "stores" information needed to predict contacts by comparing sequence masking strategies and find that providing local windows of sequence information allows ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. The study highlights the limitations of current pLMs and emphasizes the importance of understanding their underlying mechanisms.

Protein language models learn evolutionary statistics of interacting sequence motifs

October 28, 2024 | Zhidian Zhang, Hannah K. Wayment-Steele, Garyk Brixi, Haobo Wang, Dorothee Kern, and Sergey Ovchinnikov