Understanding dbCAN%3A a web resource for automated carbohydrate-active enzyme annotation

dbCAN is a web resource for automated carbohydrate-active enzyme (CAZyme) annotation. The authors developed dbCAN to provide a tool for automatically annotating CAZyme families based on signature domain models. These models are derived from the CDD database and literature curation, and are used to build hidden Markov models (HMMs) for each CAZyme family. These HMMs are the key contribution of dbCAN and enable automated CAZyme annotation. The CAZy database (CAZyDB) is the most comprehensive database for CAZyme proteins, but it has limitations, such as not providing an easy way to query, search or download sequence, structure and annotation data, not defining signature domains for CAZyme families, and not providing automated annotation of CAZyme members in a given genome. To address these issues, the authors developed dbCAN, which provides a web server for automated CAZyme annotation and offers access to sequences, domain models, alignments and phylogeny data of CAZyme-related enzyme families and functional modules. The authors identified and defined signature domains for each CAZyme family by analyzing CDD search results and published literature. They used RPS-BLAST to search for CDD models that match most of the proteins in each family. They then built HMMs based on multiple sequence alignments of the identified CDD domain regions. For 248 CAZyme families, they successfully built HMMs, while for the remaining 60 families, they used manual curation of published literature to identify initial signature domains and then built HMMs based on these domains. The authors evaluated the accuracy of dbCAN's annotation by comparing it with the CAZyDB annotation. They found that dbCAN's annotation had high sensitivity and positive predictive value (PPV) for both bacterial and plant genomes. They also compared dbCAN's performance with BLAST-based and CDD-based search strategies and found that dbCAN provided more accurate and comprehensive annotation. dbCAN provides a web server for automated CAZyme annotation and offers pre-computed sequence alignments, HMMs and phylogenies of the signature domains in each CAZyme family. It also provides CAZyme family-based browsing, genome-based browsing, keyword search, BLAST search and detailed functional annotation for every sequence included in dbCAN. The authors applied dbCAN to metagenome datasets and found that it identified over one million full-length CAZyme homologous proteins, which is three times the number of CAZyme homologs in the NCBI-nr database. This indicates that there are many new CAZyme-related proteins in environmental metagenomes that are waiting for further investigation. In conclusion, dbCAN provides a free, easy-to-use and public service for automated CAZyme annotation. It offers a unique collection of CAZyme family-specific HMMs, which are built based on the annotated CAZyme proteins bydbCAN is a web resource for automated carbohydrate-active enzyme (CAZyme) annotation. The authors developed dbCAN to provide a tool for automatically annotating CAZyme families based on signature domain models. These models are derived from the CDD database and literature curation, and are used to build hidden Markov models (HMMs) for each CAZyme family. These HMMs are the key contribution of dbCAN and enable automated CAZyme annotation. The CAZy database (CAZyDB) is the most comprehensive database for CAZyme proteins, but it has limitations, such as not providing an easy way to query, search or download sequence, structure and annotation data, not defining signature domains for CAZyme families, and not providing automated annotation of CAZyme members in a given genome. To address these issues, the authors developed dbCAN, which provides a web server for automated CAZyme annotation and offers access to sequences, domain models, alignments and phylogeny data of CAZyme-related enzyme families and functional modules. The authors identified and defined signature domains for each CAZyme family by analyzing CDD search results and published literature. They used RPS-BLAST to search for CDD models that match most of the proteins in each family. They then built HMMs based on multiple sequence alignments of the identified CDD domain regions. For 248 CAZyme families, they successfully built HMMs, while for the remaining 60 families, they used manual curation of published literature to identify initial signature domains and then built HMMs based on these domains. The authors evaluated the accuracy of dbCAN's annotation by comparing it with the CAZyDB annotation. They found that dbCAN's annotation had high sensitivity and positive predictive value (PPV) for both bacterial and plant genomes. They also compared dbCAN's performance with BLAST-based and CDD-based search strategies and found that dbCAN provided more accurate and comprehensive annotation. dbCAN provides a web server for automated CAZyme annotation and offers pre-computed sequence alignments, HMMs and phylogenies of the signature domains in each CAZyme family. It also provides CAZyme family-based browsing, genome-based browsing, keyword search, BLAST search and detailed functional annotation for every sequence included in dbCAN. The authors applied dbCAN to metagenome datasets and found that it identified over one million full-length CAZyme homologous proteins, which is three times the number of CAZyme homologs in the NCBI-nr database. This indicates that there are many new CAZyme-related proteins in environmental metagenomes that are waiting for further investigation. In conclusion, dbCAN provides a free, easy-to-use and public service for automated CAZyme annotation. It offers a unique collection of CAZyme family-specific HMMs, which are built based on the annotated CAZyme proteins by

dbCAN: a web resource for automated carbohydrate-active enzyme annotation

2012 | Yanbin Yin, Xizeng Mao, Jincui Yang, Xin Chen, Fenglou Mao and Ying Xu