[slides and audio] NCBI Reference Sequences (RefSeq)%3A current status%2C new features and genome annotation policy

The NCBI Reference Sequence (RefSeq) database is a collection of genomic, transcript, and protein sequence records, selected and curated from public sequence archives. It includes over 16,000 organisms, 2.4 million genomic records, 13 million protein records, and 2 million RNA records. The database is maintained through automated analyses, collaboration, and manual curation to ensure up-to-date sequence information. RefSeq provides a stable and consistent coordinate system for clinical variation reporting, comparative genomics, and evolutionary studies. It integrates with other NCBI resources like dbSNP, Gene, and Genomes. RefSeq records are identified by a distinct accession format and are available via Internet query, FTP, BLAST, or scripted queries. The database continues to grow as new genome and transcript sequences become publicly available. Release 49 (September 2011) includes records from 16,248 species, with a 49.7% increase in the number of organisms and a 14.5% increase in the number of accessions. Microbial organisms account for the greatest number of organisms and accessions, with a significant annual growth in the number of organisms. The human RefSeq records are actively curated to improve the quality of reference sequences and provide functionally relevant information. RefSeq provides region-specific genomic records for non-transcribed pseudogenes and the RefSeqGene project. Pseudogene loci are defined through collaboration with the HUGO Gene Nomenclature Committee or by RefSeq curation staff. RefSeqGene records are used for reporting sequence variation in medical records and locus-specific databases. Transcripts and proteins are a major focus for curation. The database includes two major categories: 'model' and 'known' subsets. The 'model' subset is generated by NCBI's genome annotation pipeline, while the 'known' subset is maintained independently. RefSeq continues to represent protein-coding regions that are considered to be full length and transcripts that are at least near complete. Transcripts that are obviously partial are not represented but are presented in NCBI's genome browser. RefSeq has implemented new policies for protein names, readthrough transcripts, and non-coding RNAs. It also expands feature annotation to indicate localization or function. The RefSeq genome annotation policy considers factors such as quality, completeness, phylogenetic distance, model organism status, and impact on disease and health studies. RefSeq genome representation for prokaryotes is managed by propagating annotation from GenBank, while eukaryotic genomes are managed based on general taxonomic groups and model organism databases. The RefSeq group aims to be more transparent with curation decisions and support evidence. They are working on reporting more explicit information about the underlying support for exon combinations in RefSeq transcripts. RefSeq also plans to provide a comparison utility to evaluate putative functional consequences among transcript variants.The NCBI Reference Sequence (RefSeq) database is a collection of genomic, transcript, and protein sequence records, selected and curated from public sequence archives. It includes over 16,000 organisms, 2.4 million genomic records, 13 million protein records, and 2 million RNA records. The database is maintained through automated analyses, collaboration, and manual curation to ensure up-to-date sequence information. RefSeq provides a stable and consistent coordinate system for clinical variation reporting, comparative genomics, and evolutionary studies. It integrates with other NCBI resources like dbSNP, Gene, and Genomes. RefSeq records are identified by a distinct accession format and are available via Internet query, FTP, BLAST, or scripted queries. The database continues to grow as new genome and transcript sequences become publicly available. Release 49 (September 2011) includes records from 16,248 species, with a 49.7% increase in the number of organisms and a 14.5% increase in the number of accessions. Microbial organisms account for the greatest number of organisms and accessions, with a significant annual growth in the number of organisms. The human RefSeq records are actively curated to improve the quality of reference sequences and provide functionally relevant information. RefSeq provides region-specific genomic records for non-transcribed pseudogenes and the RefSeqGene project. Pseudogene loci are defined through collaboration with the HUGO Gene Nomenclature Committee or by RefSeq curation staff. RefSeqGene records are used for reporting sequence variation in medical records and locus-specific databases. Transcripts and proteins are a major focus for curation. The database includes two major categories: 'model' and 'known' subsets. The 'model' subset is generated by NCBI's genome annotation pipeline, while the 'known' subset is maintained independently. RefSeq continues to represent protein-coding regions that are considered to be full length and transcripts that are at least near complete. Transcripts that are obviously partial are not represented but are presented in NCBI's genome browser. RefSeq has implemented new policies for protein names, readthrough transcripts, and non-coding RNAs. It also expands feature annotation to indicate localization or function. The RefSeq genome annotation policy considers factors such as quality, completeness, phylogenetic distance, model organism status, and impact on disease and health studies. RefSeq genome representation for prokaryotes is managed by propagating annotation from GenBank, while eukaryotic genomes are managed based on general taxonomic groups and model organism databases. The RefSeq group aims to be more transparent with curation decisions and support evidence. They are working on reporting more explicit information about the underlying support for exon combinations in RefSeq transcripts. RefSeq also plans to provide a comparison utility to evaluate putative functional consequences among transcript variants.

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

2012 | Kim D. Pruitt*, Tatiana Tatusova, Garth R. Brown and Donna R. Maglott