Published online 1 December 2011 | Scott Federhen*
The NCBI Taxonomy database is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), including GenBank, ENA (EMBL), and DDBJ. It provides organism names and taxonomic lineages for sequences in the INSDC's nucleotide and protein databases. The database is manually curated by a small team of scientists at the NCBI, using current taxonomic literature to maintain a phylogenetic taxonomy. It serves as a central hub for many NCBI resources, facilitating clustering within other domains of the Entrez system and linking to external taxon-specific resources.
The project began in 1991 when the Entrez information retrieval system was designed, aiming to unify nucleotide and protein sequences with relevant abstracts from the scientific literature. The initial challenge was to merge taxonomies from different sources, which were derived from common but divergent sources. Over time, the INSDC members agreed to resolve taxonomic issues before releasing new sequence data, improving the process.
The NCBI Taxonomy database includes formal and informal names for species, with formal names regulated by specific codes of nomenclature. It supports various name types, such as scientific names, synonyms, misspellings, and common names. The database is stored in an SQL Server relational database called TAXON, and public access is provided through the Taxonomy Browser, the Taxonomy domain of Entrez, and an FTP site.
Future initiatives, such as the Barcodes of Life project, aim to expand the database by sequencing reference specimens from every eukaryotic species. The database is funded by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.The NCBI Taxonomy database is the standard nomenclature and classification repository for the International Nucleotide Sequence Database Collaboration (INSDC), including GenBank, ENA (EMBL), and DDBJ. It provides organism names and taxonomic lineages for sequences in the INSDC's nucleotide and protein databases. The database is manually curated by a small team of scientists at the NCBI, using current taxonomic literature to maintain a phylogenetic taxonomy. It serves as a central hub for many NCBI resources, facilitating clustering within other domains of the Entrez system and linking to external taxon-specific resources.
The project began in 1991 when the Entrez information retrieval system was designed, aiming to unify nucleotide and protein sequences with relevant abstracts from the scientific literature. The initial challenge was to merge taxonomies from different sources, which were derived from common but divergent sources. Over time, the INSDC members agreed to resolve taxonomic issues before releasing new sequence data, improving the process.
The NCBI Taxonomy database includes formal and informal names for species, with formal names regulated by specific codes of nomenclature. It supports various name types, such as scientific names, synonyms, misspellings, and common names. The database is stored in an SQL Server relational database called TAXON, and public access is provided through the Taxonomy Browser, the Taxonomy domain of Entrez, and an FTP site.
Future initiatives, such as the Barcodes of Life project, aim to expand the database by sequencing reference specimens from every eukaryotic species. The database is funded by the Intramural Research Program of the National Institutes of Health, National Library of Medicine.