Exploring structural diversity across the protein universe with The Encyclopedia of Domains

Exploring structural diversity across the protein universe with The Encyclopedia of Domains

March 19, 2024 | Lau, A. M.¹⁺, Bordin, N.²⁺, Kandathil, S. M.¹, Sillitoe, I.², Waman, V. P.², Wells, J.²³, Orengo, C.² and Jones, D. T.¹²
The Encyclopedia of Domains (TED) is a comprehensive resource that identifies and classifies protein domains across the AlphaFold Protein Structure Database (AFDB), which contains predicted structures for over 214 million UniProt sequences. TED combines advanced deep learning-based domain parsing and structure comparison algorithms to segment and classify domains, revealing over 370 million domains, more than sequence-based methods. TED significantly expands the known set of protein structural domains by identifying over 10,000 previously unseen structural interactions between superfamilies and uncovering thousands of new architectures and folds across the protein fold space. TED provides a functional interface to the AFDB, enabling a wide range of downstream analyses. TED identifies domains using three automated parsing methods (Merizo, Chainsaw, and UniDoc) and structural comparison methods (Foldseek and Foldclass-search), allowing over 251 million domains to be placed on the CATH hierarchy. TED reveals 7,427 putative novel architectures and folds, expanding the coverage of known protein structural domains. TED also identifies high-symmetry domains and novel domain architectures, including an 11-bladed beta-propeller and an 11-helix propeller, which have not been seen before. TED further identifies novel domains across the Tree of Life, including a curious archaeal domain found as a sequence singleton and a novel domain found only in eukaryotes. TED also identifies novel interactions between domain pairs, revealing that many of these interactions are unique to TED. TED contains 27,280,057 instances of interacting domains across 13,771 Interacting Superfamily Pairs (ISPs), compared to 196,234 instances across 5,111 ISPs in CATH. TED essentially doubles the set of known domain interactions at the superfamily level, providing a starting point for investigations into novel functionally important interactions between domain families. TED also addresses the issue of redundant sequences in the AFDB, identifying that nearly 39 million structures are exact sequence duplicates of other proteins in the database. TED identifies approximately 69 million domains across the TED-redundant set, highlighting the structural variation within these sequence-redundant clusters. TED provides a detailed analysis of domain-level deviations across models of identical sequences, identifying cases where the consensus domain is dramatically different in different proteins. TED is a valuable resource that provides a functional interface to the AFDB, empowering it to be useful for a multitude of downstream analyses. TED is an ongoing development that evolves with the needs of its users, aiming to provide the community with the most comprehensive summary and breakdown of the structures within the AFDB. TED is expected to be used as a starting point for a whole host of analyses, including providing a comprehensive dataset to train and test a new generation of deep learning based applications.The Encyclopedia of Domains (TED) is a comprehensive resource that identifies and classifies protein domains across the AlphaFold Protein Structure Database (AFDB), which contains predicted structures for over 214 million UniProt sequences. TED combines advanced deep learning-based domain parsing and structure comparison algorithms to segment and classify domains, revealing over 370 million domains, more than sequence-based methods. TED significantly expands the known set of protein structural domains by identifying over 10,000 previously unseen structural interactions between superfamilies and uncovering thousands of new architectures and folds across the protein fold space. TED provides a functional interface to the AFDB, enabling a wide range of downstream analyses. TED identifies domains using three automated parsing methods (Merizo, Chainsaw, and UniDoc) and structural comparison methods (Foldseek and Foldclass-search), allowing over 251 million domains to be placed on the CATH hierarchy. TED reveals 7,427 putative novel architectures and folds, expanding the coverage of known protein structural domains. TED also identifies high-symmetry domains and novel domain architectures, including an 11-bladed beta-propeller and an 11-helix propeller, which have not been seen before. TED further identifies novel domains across the Tree of Life, including a curious archaeal domain found as a sequence singleton and a novel domain found only in eukaryotes. TED also identifies novel interactions between domain pairs, revealing that many of these interactions are unique to TED. TED contains 27,280,057 instances of interacting domains across 13,771 Interacting Superfamily Pairs (ISPs), compared to 196,234 instances across 5,111 ISPs in CATH. TED essentially doubles the set of known domain interactions at the superfamily level, providing a starting point for investigations into novel functionally important interactions between domain families. TED also addresses the issue of redundant sequences in the AFDB, identifying that nearly 39 million structures are exact sequence duplicates of other proteins in the database. TED identifies approximately 69 million domains across the TED-redundant set, highlighting the structural variation within these sequence-redundant clusters. TED provides a detailed analysis of domain-level deviations across models of identical sequences, identifying cases where the consensus domain is dramatically different in different proteins. TED is a valuable resource that provides a functional interface to the AFDB, empowering it to be useful for a multitude of downstream analyses. TED is an ongoing development that evolves with the needs of its users, aiming to provide the community with the most comprehensive summary and breakdown of the structures within the AFDB. TED is expected to be used as a starting point for a whole host of analyses, including providing a comprehensive dataset to train and test a new generation of deep learning based applications.
Reach us at info@study.space
Understanding Exploring structural diversity across the protein universe with The Encyclopedia of Domains