2020, Vol. 48, Database issue | Shennan Lu, Jiyao Wang, Farideh Chitsaz, Myra K. Derbyshire, Renata C. Geer, Noreen R. Gonzales, Marc Gwadz, David I. Hurwitz, Gabriele H. Marchler, James S. Song, Narmada Thanki, Roxanne A. Yamashita, Mingzhang Yang, Dachuan Zhang, Chanjuan Zheng, Christopher J. Lanczycki and Aron Marchler-Bauer
The article provides an overview of the Conserved Domain Database (CDD) as it enters its 20th year of operation. CDD, maintained by the National Center for Biotechnology Information (NCBI), offers a comprehensive resource for protein domain annotations and conserved site recordings. The database supports hypothesis-driven biomolecular research through its hierarchical classifications and live search services. CDD v3.17, the current production version, includes 52,910 protein and protein domain models from various sources, with v3.18 expected to release in winter 2019/2020, featuring 55,434 models. The NCBIfam collection, derived from Hidden Markov Models (HMMs), is used to improve bacterial genome annotations, though it excludes models related to antimicrobial resistance. CDD covers about 85% of sequences in the Entrez/protein database and 94% of protein sequences from 3D structures. SPARCLE, the Subfamily Protein Architecture Labeling Engine, assigns names and functional labels to subfamily domain architectures, particularly those common in bacterial genomes, supporting automated protein naming in RefSeq and the Prokaryotic Genome Annotation Pipeline (PGAP). CDD also shares domain models with InterPro to enhance sequence annotations. Future work includes exploring model-specific word-score thresholds for RPS-BLAST search databases to improve efficiency. The article acknowledges the contributions of various teams and resources and highlights the funding sources for the project.The article provides an overview of the Conserved Domain Database (CDD) as it enters its 20th year of operation. CDD, maintained by the National Center for Biotechnology Information (NCBI), offers a comprehensive resource for protein domain annotations and conserved site recordings. The database supports hypothesis-driven biomolecular research through its hierarchical classifications and live search services. CDD v3.17, the current production version, includes 52,910 protein and protein domain models from various sources, with v3.18 expected to release in winter 2019/2020, featuring 55,434 models. The NCBIfam collection, derived from Hidden Markov Models (HMMs), is used to improve bacterial genome annotations, though it excludes models related to antimicrobial resistance. CDD covers about 85% of sequences in the Entrez/protein database and 94% of protein sequences from 3D structures. SPARCLE, the Subfamily Protein Architecture Labeling Engine, assigns names and functional labels to subfamily domain architectures, particularly those common in bacterial genomes, supporting automated protein naming in RefSeq and the Prokaryotic Genome Annotation Pipeline (PGAP). CDD also shares domain models with InterPro to enhance sequence annotations. Future work includes exploring model-specific word-score thresholds for RPS-BLAST search databases to improve efficiency. The article acknowledges the contributions of various teams and resources and highlights the funding sources for the project.