Rfam: updates to the RNA families database

Rfam: updates to the RNA families database

2009 | Paul P. Gardner, Jennifer Daub, John G. Tate, Eric P. Nawrocki, Diana L. Kolbe, Stinus Lindgreen, Adam C. Wilkinson, Robert D. Finn, Sam Griffiths-Jones, Sean R. Eddy and Alex Bateman
Rfam is a database of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination with CMs. Recent improvements to the website, methodologies and data used by Rfam are discussed. Rfam is freely available on the Web at http://rfam.sanger.ac.uk/ and http://rfam.janelia.org/. Rfam 9.0 contains 603 families, each represented by a multiple sequence alignment of known and predicted representative members of the family, annotated with a consensus base-paired secondary structure. These alignments are used to build CMs with the Infernal software. Each Rfam covariance model is searched against a nucleotide sequence database, producing a list of putative hits. Matches that score above a curated threshold are then aligned to the CM to produce a so-called FULL alignment. RFAMSEQ, the underlying nucleotide sequence database, has been expanded to include whole genome shotgun (WGS) and environmental sequence (ENV) divisions, increasing the number of sequences by more than an order of magnitude. Sequence filters have been improved to enhance the sensitivity and specificity of the Rfam annotation pipeline. These filters include the use of WU-BLAST for homology searches and a sequence mask to reduce the search space. More than 370 families have been expanded through an 'iteration' process, in which some sequences in the FULL alignment are chosen for promotion to the SEED alignment. The sequences selected must pass stringent quality control requirements and be manually approved by a curator. Phylogenetic trees have been estimated for both the SEED and FULL alignments. For the majority of the alignments, trees were produced using an accurate maximum-likelihood approach, which included models of indels. However, for larger alignments, a neighbour-joining method was used instead. The Rfam website has been redesigned to improve the presentation of Rfam data and provide more and better tools for searching the data. The new site provides detailed overviews of Rfam families, including taxonomic trees and phylogenetic trees for the SEED and FULL alignments. New graphical representations of secondary structures have been added to the Rfam website, based on software from the Vienna RNA package. These representations include sequence conservation, covariation, base-pair conservation and CM scores. The Rfam website now draws textual annotation of RNA families directly from Wikipedia. Any updates to relevant Wikipedia articles are downloaded on a nightly basis using the MediaWiki API, verified by members of the consortium and presented on the Rfam site. The rate of discovery of new RNA families is accelerating rapidly, facilitated by advancements in new sequencing technologies and targeted computational screens. Rfam continues to evaluate new technologies and techniques as they emerge and will adopt new procedures for building and checkingRfam is a database of RNA sequence families, represented by multiple sequence alignments and covariance models (CMs). The primary aim of Rfam is to annotate new members of known RNA families on nucleotide sequences, particularly complete genomes, using sensitive BLAST filters in combination with CMs. Recent improvements to the website, methodologies and data used by Rfam are discussed. Rfam is freely available on the Web at http://rfam.sanger.ac.uk/ and http://rfam.janelia.org/. Rfam 9.0 contains 603 families, each represented by a multiple sequence alignment of known and predicted representative members of the family, annotated with a consensus base-paired secondary structure. These alignments are used to build CMs with the Infernal software. Each Rfam covariance model is searched against a nucleotide sequence database, producing a list of putative hits. Matches that score above a curated threshold are then aligned to the CM to produce a so-called FULL alignment. RFAMSEQ, the underlying nucleotide sequence database, has been expanded to include whole genome shotgun (WGS) and environmental sequence (ENV) divisions, increasing the number of sequences by more than an order of magnitude. Sequence filters have been improved to enhance the sensitivity and specificity of the Rfam annotation pipeline. These filters include the use of WU-BLAST for homology searches and a sequence mask to reduce the search space. More than 370 families have been expanded through an 'iteration' process, in which some sequences in the FULL alignment are chosen for promotion to the SEED alignment. The sequences selected must pass stringent quality control requirements and be manually approved by a curator. Phylogenetic trees have been estimated for both the SEED and FULL alignments. For the majority of the alignments, trees were produced using an accurate maximum-likelihood approach, which included models of indels. However, for larger alignments, a neighbour-joining method was used instead. The Rfam website has been redesigned to improve the presentation of Rfam data and provide more and better tools for searching the data. The new site provides detailed overviews of Rfam families, including taxonomic trees and phylogenetic trees for the SEED and FULL alignments. New graphical representations of secondary structures have been added to the Rfam website, based on software from the Vienna RNA package. These representations include sequence conservation, covariation, base-pair conservation and CM scores. The Rfam website now draws textual annotation of RNA families directly from Wikipedia. Any updates to relevant Wikipedia articles are downloaded on a nightly basis using the MediaWiki API, verified by members of the consortium and presented on the Rfam site. The rate of discovery of new RNA families is accelerating rapidly, facilitated by advancements in new sequencing technologies and targeted computational screens. Rfam continues to evaluate new technologies and techniques as they emerge and will adopt new procedures for building and checking
Reach us at info@study.space
[slides and audio] Rfam%3A updates to the RNA families database