VOL. 14, NO. 8, AUGUST 2021 | Ahmed Elnaggar, Michael Heinzinger, Christian Dallago, Ghalia Rehawi, Yu Wang, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Martin Steinegger, Debsindhu Bhowmik and Burkhard Rost
The paper "ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning" by Ahmed Elnagar et al. explores the use of large-scale language models (LMs) trained on protein sequences to predict protein properties and structure. The authors trained six auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on extensive protein datasets, including UniRef and BFD, which contain up to 393 billion amino acids. These models were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up to 1024 cores. The raw protein LM embeddings from unlabeled data captured biophysical features of protein sequences, such as charge, polarity, size, and hydrophobicity. These embeddings were validated for several tasks, including per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%), per-protein prediction of sub-cellular localization (ten-state accuracy Q10=81%), and membrane vs. water-soluble classification (2-state accuracy Q2=91%). The results showed that the most informative embeddings (Prot15) outperformed state-of-the-art methods without using evolutionary information, demonstrating that protein LMs learned some of the grammar of the language of life. The authors released their models at https://github.com/agemagician/ProtTrans to facilitate future research.The paper "ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning" by Ahmed Elnagar et al. explores the use of large-scale language models (LMs) trained on protein sequences to predict protein properties and structure. The authors trained six auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on extensive protein datasets, including UniRef and BFD, which contain up to 393 billion amino acids. These models were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up to 1024 cores. The raw protein LM embeddings from unlabeled data captured biophysical features of protein sequences, such as charge, polarity, size, and hydrophobicity. These embeddings were validated for several tasks, including per-residue prediction of protein secondary structure (3-state accuracy Q3=81%-87%), per-protein prediction of sub-cellular localization (ten-state accuracy Q10=81%), and membrane vs. water-soluble classification (2-state accuracy Q2=91%). The results showed that the most informative embeddings (Prot15) outperformed state-of-the-art methods without using evolutionary information, demonstrating that protein LMs learned some of the grammar of the language of life. The authors released their models at https://github.com/agemagician/ProtTrans to facilitate future research.