July 2006 | T. Z. DeSantis, P. Hugenholtz, N. Larsen, M. Rojas, E. L. Brodie, T. Huber, D. Dalevi, P. Hu, and G. L. Andersen
Greengenes is a 16S rRNA gene database and workbench that addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.
Greengenes addresses these concerns by providing four features: a standardized set of descriptive fields, taxonomic assignment, chimera screening, and ARB compatibility. Heuristics are used to consider the author's annotations and categorize each source as a named or unnamed isolate, an unnamed symbiont, or an uncultured organism. Other standard descriptors include sequence quality measurements, authors, and a "study_id" that links all the records associated with a project. Greengenes maintains a consistent multiple-sequence alignment (MSA) of both archaeal and bacterial 16S small-subunit rRNA genes to facilitate taxonomic placement.
Taxonomy proposed by independent curators, including the NCBI, the Ribosomal Database Project (RDP), Wolfgang Ludwig, Phil Hugenholtz, and Norman Pace, is tracked to promote user awareness of several estimations of phylogenetic descent, allowing a balanced approach to node nomenclature when dendrograms are generated. Comprehensive chimera assessment is a distinguishing characteristic of the Greengenes data assembly process. Each sequence is scored for chimeric potential, a breakpoint is estimated, and parent sequences are identified.
Greengenes simplifies the chore of keeping a research group's private ARB database current by providing standardized alignments and an import filter (greengenes.ift) that imports the alignment and other standardized fields from 16S small-subunit rRNA gene records vetted weekly from GenBank.
To illustrate the utility of the Greengenes data assembly process and to examine the validity of prokaryotic candidate phyla, more than 90,000 public 16S small-subunit rRNA gene sequences were aligned and chimera checked. Taxonomic classifications from the major curators were used when such classifications were available. Sequence data were imported from NCBI for complete or nearly complete gene sequences deposited as of 2 April 2006. Alignment of both archaeal and bacterial sequences was performed with the NAST aligner against a "Core Set" of templates selected from a phylogenetically broad collection.
For high-throughput chimera screening of the aligned sequences, the program Bellerophon was used with two modifications. First, the algorithm was modified to reduce the number of potential parents considered in the partial trees, which allowed run time to scale linearly rather than logarithmically with theGreengenes is a 16S rRNA gene database and workbench that addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.
Greengenes addresses these concerns by providing four features: a standardized set of descriptive fields, taxonomic assignment, chimera screening, and ARB compatibility. Heuristics are used to consider the author's annotations and categorize each source as a named or unnamed isolate, an unnamed symbiont, or an uncultured organism. Other standard descriptors include sequence quality measurements, authors, and a "study_id" that links all the records associated with a project. Greengenes maintains a consistent multiple-sequence alignment (MSA) of both archaeal and bacterial 16S small-subunit rRNA genes to facilitate taxonomic placement.
Taxonomy proposed by independent curators, including the NCBI, the Ribosomal Database Project (RDP), Wolfgang Ludwig, Phil Hugenholtz, and Norman Pace, is tracked to promote user awareness of several estimations of phylogenetic descent, allowing a balanced approach to node nomenclature when dendrograms are generated. Comprehensive chimera assessment is a distinguishing characteristic of the Greengenes data assembly process. Each sequence is scored for chimeric potential, a breakpoint is estimated, and parent sequences are identified.
Greengenes simplifies the chore of keeping a research group's private ARB database current by providing standardized alignments and an import filter (greengenes.ift) that imports the alignment and other standardized fields from 16S small-subunit rRNA gene records vetted weekly from GenBank.
To illustrate the utility of the Greengenes data assembly process and to examine the validity of prokaryotic candidate phyla, more than 90,000 public 16S small-subunit rRNA gene sequences were aligned and chimera checked. Taxonomic classifications from the major curators were used when such classifications were available. Sequence data were imported from NCBI for complete or nearly complete gene sequences deposited as of 2 April 2006. Alignment of both archaeal and bacterial sequences was performed with the NAST aligner against a "Core Set" of templates selected from a phylogenetically broad collection.
For high-throughput chimera screening of the aligned sequences, the program Bellerophon was used with two modifications. First, the algorithm was modified to reduce the number of potential parents considered in the partial trees, which allowed run time to scale linearly rather than logarithmically with the