CDD/SPARCLE: the conserved domain database in 2020

CDD/SPARCLE: the conserved domain database in 2020

2020 | Shennan Lu, Jiyao Wang, Farideh Chitsaz, Myra K. Derbyshire, Renata C. Geer, Noreen R. Gonzales, Marc Gwadz, David I. Hurwitz, Gabriele H. Marchler, James S. Song, Narmada Thanki, Roxanne A. Yamashita, Mingzhang Yang, Dachuan Zhang, Chanjuan Zheng, Christopher J. Lanczycki and Aron Marchler-Bauer
The Conserved Domain Database (CDD) is a public resource that provides domain annotations for proteins and nucleotides. It offers both pre-computed domain annotations and live search services for single protein or nucleotide queries and larger sets of protein sequences. CDD curators continue to develop hierarchical classifications of protein domain families and record conserved sites associated with molecular function. These annotations are used to support hypothesis-driven biomolecular research. CDD also provides a significant corpus of curated domain architectures to support naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD version v3.17 includes 52,910 protein and protein domain models from various sources, including Pfam, SMART, COGs, TIGRFAMS, NCBI Protein Clusters, NCBIfam, and CDD's in-house data curation. CDD v3.18 will include Pfam version 32 and 55,434 protein and protein-domain models. The domain model database size has been increased to match the current model collection size, resulting in marginally higher E-values reported by RPS-BLAST. The NCBIfam collection in CDD is a set of models derived from HMMs developed for improving bacterial genome annotation. Currently, CDD excludes NCBIfam models that were built to identify proteins involved in antimicrobial resistance due to their narrow scope. CDD curators have assigned names and functional labels to approximately 25,000 subfamily domain architectures (SDAs), with a focus on those common in bacterial genomes. SPARCLE supports the automated, evidence-based assignment of names to proteins in RefSeq and the Prokaryotic Genome Annotation Pipeline (PGAP). About 42 million bacterial RefSeq proteins are named via SPARCLE. CDD shares domain models with InterPro to supplement sequence annotation with data uniquely provided by CDD, including protein domain models for specific subfamilies and functional site annotations. Over 3,100 domain signatures provided by CDD have been integrated by InterPro. CDD is investigating whether model-specific word-score thresholds can be applied when building RPS-BLAST search databases to speed searching while minimizing annotation loss. Instructions for using such a search set will be announced via the CDD news page. CDD also provides tools and data collections, including RPS-BLAST, CD-Search, and sparclbl. CDD is supported by the National Library of Medicine's Intramural Research Program.The Conserved Domain Database (CDD) is a public resource that provides domain annotations for proteins and nucleotides. It offers both pre-computed domain annotations and live search services for single protein or nucleotide queries and larger sets of protein sequences. CDD curators continue to develop hierarchical classifications of protein domain families and record conserved sites associated with molecular function. These annotations are used to support hypothesis-driven biomolecular research. CDD also provides a significant corpus of curated domain architectures to support naming bacterial proteins in RefSeq. These architecture definitions are available via SPARCLE, the Subfamily Protein Architecture Labeling Engine. CDD version v3.17 includes 52,910 protein and protein domain models from various sources, including Pfam, SMART, COGs, TIGRFAMS, NCBI Protein Clusters, NCBIfam, and CDD's in-house data curation. CDD v3.18 will include Pfam version 32 and 55,434 protein and protein-domain models. The domain model database size has been increased to match the current model collection size, resulting in marginally higher E-values reported by RPS-BLAST. The NCBIfam collection in CDD is a set of models derived from HMMs developed for improving bacterial genome annotation. Currently, CDD excludes NCBIfam models that were built to identify proteins involved in antimicrobial resistance due to their narrow scope. CDD curators have assigned names and functional labels to approximately 25,000 subfamily domain architectures (SDAs), with a focus on those common in bacterial genomes. SPARCLE supports the automated, evidence-based assignment of names to proteins in RefSeq and the Prokaryotic Genome Annotation Pipeline (PGAP). About 42 million bacterial RefSeq proteins are named via SPARCLE. CDD shares domain models with InterPro to supplement sequence annotation with data uniquely provided by CDD, including protein domain models for specific subfamilies and functional site annotations. Over 3,100 domain signatures provided by CDD have been integrated by InterPro. CDD is investigating whether model-specific word-score thresholds can be applied when building RPS-BLAST search databases to speed searching while minimizing annotation loss. Instructions for using such a search set will be announced via the CDD news page. CDD also provides tools and data collections, including RPS-BLAST, CD-Search, and sparclbl. CDD is supported by the National Library of Medicine's Intramural Research Program.
Reach us at info@futurestudyspace.com