SMART: a web-based tool for the study of genetically mobile domains

SMART: a web-based tool for the study of genetically mobile domains

2000 | Jörg Schultz¹², Richard R. Copley¹², Tobias Doerks¹², Chris P. Ponting³ and Peer Bork¹²,
SMART is a web-based tool for the study of genetically mobile domains. It allows the identification and annotation of domains in proteins, and the analysis of domain architectures. Over 400 domain families are detectable, including those found in signaling, extracellular and chromatin-associated proteins. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database, along with search parameters and taxonomic information, is stored in a relational database system. User interfaces allow searches for proteins containing specific combinations of domains in defined taxa. The SMART alignment set relies on multiple sequence alignments of representative family members. In the past year, the alignment construction method has been improved to achieve higher levels of reproducibility and the number of detectable domain families has increased. Older alignments have been updated to integrate new homology and structural findings. As a result, SMART alignments are of high quality and have been used in recent comparative genomics studies. The alignment construction protocol starts with an alignment of divergent family members based on known tertiary structures or homologues identified in a PSI-BLAST analysis. These alignments are manually optimized and used to search current sequence databases. Each sequence of the alignment is also used as a query in a PSI-BLAST search. All sequences that are significantly similar are added to the alignment using the sequence versus HMM alignment method of HMMer. Alignments are checked manually for potential false positives or misassembled protein sequences. From this alignment, one of each sequence pair sharing >67% identity is deleted to reduce redundancy. The resulting alignment is used as a starting point for a subsequent round of searches. This iterative procedure is pursued until no new homologues are detected. SMART has been expanded to detect domains of extracellular proteins and bacterial two-component regulatory systems. In 1999, domains associated with DNA, RNA, chromatin and actin cytoskeleton functions have been added. In addition, new reported domain families that fall within the categories covered by SMART have been incorporated. These include extracellular GPS and PSI domains, intracellular signalling domains such as ENTH and GoLoco, and domains in splicing factors. As a result, SMART now includes >400 domains. The SMART database stores information on >400 domain types in >54,000 different proteins using a relational database management system. For each domain hit, boundaries, raw bit score and E-value are recorded. The protein accession code, description line, sequence length and species name are stored. To allow phylogenetic analyses, the full taxonomic description for each species derived from the NCBI Taxonomy database is also recorded. Each SMART domain is identified by a unique accession number, thus providing stable references for other domain databases and is linked to corresponding domains in Pfam and PROSITE. By including annotation, search parameters and cross-references to otherSMART is a web-based tool for the study of genetically mobile domains. It allows the identification and annotation of domains in proteins, and the analysis of domain architectures. Over 400 domain families are detectable, including those found in signaling, extracellular and chromatin-associated proteins. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database, along with search parameters and taxonomic information, is stored in a relational database system. User interfaces allow searches for proteins containing specific combinations of domains in defined taxa. The SMART alignment set relies on multiple sequence alignments of representative family members. In the past year, the alignment construction method has been improved to achieve higher levels of reproducibility and the number of detectable domain families has increased. Older alignments have been updated to integrate new homology and structural findings. As a result, SMART alignments are of high quality and have been used in recent comparative genomics studies. The alignment construction protocol starts with an alignment of divergent family members based on known tertiary structures or homologues identified in a PSI-BLAST analysis. These alignments are manually optimized and used to search current sequence databases. Each sequence of the alignment is also used as a query in a PSI-BLAST search. All sequences that are significantly similar are added to the alignment using the sequence versus HMM alignment method of HMMer. Alignments are checked manually for potential false positives or misassembled protein sequences. From this alignment, one of each sequence pair sharing >67% identity is deleted to reduce redundancy. The resulting alignment is used as a starting point for a subsequent round of searches. This iterative procedure is pursued until no new homologues are detected. SMART has been expanded to detect domains of extracellular proteins and bacterial two-component regulatory systems. In 1999, domains associated with DNA, RNA, chromatin and actin cytoskeleton functions have been added. In addition, new reported domain families that fall within the categories covered by SMART have been incorporated. These include extracellular GPS and PSI domains, intracellular signalling domains such as ENTH and GoLoco, and domains in splicing factors. As a result, SMART now includes >400 domains. The SMART database stores information on >400 domain types in >54,000 different proteins using a relational database management system. For each domain hit, boundaries, raw bit score and E-value are recorded. The protein accession code, description line, sequence length and species name are stored. To allow phylogenetic analyses, the full taxonomic description for each species derived from the NCBI Taxonomy database is also recorded. Each SMART domain is identified by a unique accession number, thus providing stable references for other domain databases and is linked to corresponding domains in Pfam and PROSITE. By including annotation, search parameters and cross-references to other
Reach us at info@study.space