Evolutionary-scale prediction of atomic level protein structure with a language model

Evolutionary-scale prediction of atomic level protein structure with a language model

December 21, 2022 | Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuijl, Ori Kabeli, Yaniv Shmueli, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Salvatore Candido, Alexander Rives
A language model has been developed to predict the atomic-level structure of proteins at an evolutionary scale, achieving a 60x speed improvement over state-of-the-art methods while maintaining high accuracy. This model, ESM-2, is trained on millions of protein sequences and can predict the three-dimensional structure of proteins directly from their amino acid sequences. The model's ability to learn evolutionary patterns across millions of sequences enables it to predict protein structures at the resolution of individual atoms. This approach eliminates the need for multiple sequence alignments and other computationally intensive steps, significantly speeding up the prediction process. The ESM Metagenomic Atlas, a large-scale structural characterization of metagenomic proteins, has been created, revealing over 225 million high-confidence predictions of protein structures. These predictions include many novel structures not previously observed in experimental data, providing an unprecedented view into the vast diversity of metagenomic proteins. The atlas is available as an open science resource, allowing researchers to access and analyze the predicted structures. The model's performance is closely linked to its understanding of protein sequences, as measured by perplexity. Higher perplexity values indicate better understanding of the sequence, which in turn leads to more accurate structure predictions. The model's ability to predict structures at an atomic level has been validated on various test sets, showing strong correlations with experimental data. The ESMFold model, developed from ESM-2, enables fast and accurate single-sequence structure prediction, outperforming existing methods in speed and accuracy. It achieves high accuracy in predicting protein structures, with confidence scores that are well-calibrated with the actual accuracy of the predictions. The model's predictions are available in the ESM Metagenomic Atlas, providing a valuable resource for researchers studying metagenomic proteins. The results demonstrate that language models can effectively capture evolutionary patterns in protein sequences, leading to accurate predictions of protein structures at the atomic level. This advancement has the potential to significantly accelerate progress in understanding the structure of proteins, particularly in the context of metagenomic sequencing, where the diversity of proteins is vast and largely unexplored. The findings highlight the importance of language models in capturing deep biological information from evolutionary patterns, offering new insights into protein structure and function.A language model has been developed to predict the atomic-level structure of proteins at an evolutionary scale, achieving a 60x speed improvement over state-of-the-art methods while maintaining high accuracy. This model, ESM-2, is trained on millions of protein sequences and can predict the three-dimensional structure of proteins directly from their amino acid sequences. The model's ability to learn evolutionary patterns across millions of sequences enables it to predict protein structures at the resolution of individual atoms. This approach eliminates the need for multiple sequence alignments and other computationally intensive steps, significantly speeding up the prediction process. The ESM Metagenomic Atlas, a large-scale structural characterization of metagenomic proteins, has been created, revealing over 225 million high-confidence predictions of protein structures. These predictions include many novel structures not previously observed in experimental data, providing an unprecedented view into the vast diversity of metagenomic proteins. The atlas is available as an open science resource, allowing researchers to access and analyze the predicted structures. The model's performance is closely linked to its understanding of protein sequences, as measured by perplexity. Higher perplexity values indicate better understanding of the sequence, which in turn leads to more accurate structure predictions. The model's ability to predict structures at an atomic level has been validated on various test sets, showing strong correlations with experimental data. The ESMFold model, developed from ESM-2, enables fast and accurate single-sequence structure prediction, outperforming existing methods in speed and accuracy. It achieves high accuracy in predicting protein structures, with confidence scores that are well-calibrated with the actual accuracy of the predictions. The model's predictions are available in the ESM Metagenomic Atlas, providing a valuable resource for researchers studying metagenomic proteins. The results demonstrate that language models can effectively capture evolutionary patterns in protein sequences, leading to accurate predictions of protein structures at the atomic level. This advancement has the potential to significantly accelerate progress in understanding the structure of proteins, particularly in the context of metagenomic sequencing, where the diversity of proteins is vast and largely unexplored. The findings highlight the importance of language models in capturing deep biological information from evolutionary patterns, offering new insights into protein structure and function.
Reach us at info@study.space