NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy

2012, Vol. 40, Database issue | Kim D. Pruitt*, Tatiana Tatusova, Garth R. Brown and Donna R. Maglott
The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a comprehensive collection of genomic, transcript, and protein sequence records, curated from public sequence archives to reduce redundancy. As of RefSeq release 49 (September 2011), the database includes records from over 16,000 organisms, with 2.4 million genomic records, 13 million proteins, and 2 million RNA records. The RefSeq database is maintained through a combination of automated analyses, collaboration, and manual curation, ensuring up-to-date sequence information, feature annotations, and cross-references. Recent growth in the RefSeq dataset has been significant, with an annual increase of 49.7% in the number of organisms and 14.5% in the number of accessions. Microbial organisms account for the majority of records, with a notable increase in microbial RNA records due to the RefSeq Targeted Locus project. The curation of human RefSeq records is actively managed by NCBI staff, focusing on improving quality, maintaining functional information, and coordinating with international curation groups. This includes defining pseudogene loci, maintaining RefSeqGene records, and reviewing transcripts and proteins. The database now tracks 92.5% of human protein-coding transcripts and 57.2% of non-coding transcripts with curated status. Recent changes to RefSeq include new policies for managing protein names, readthrough transcripts, and expanded representation of non-coding RNAs. Feature annotation has also been expanded to include localization, function, and details of the sequence considered during manual review. NCBI's genome annotation policy considers several factors, including sequencing quality, phylogenetic distance, and utility to research projects. The NCBI annotation pipeline is used for both prokaryotic and eukaryotic genomes, with a focus on providing a single standard annotation for reference genomes. Future directions for RefSeq include increasing transparency in curation decisions and supporting evidence, expanding transcript feature annotation, and reporting more explicit information about exon combinations and transcript variants.The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a comprehensive collection of genomic, transcript, and protein sequence records, curated from public sequence archives to reduce redundancy. As of RefSeq release 49 (September 2011), the database includes records from over 16,000 organisms, with 2.4 million genomic records, 13 million proteins, and 2 million RNA records. The RefSeq database is maintained through a combination of automated analyses, collaboration, and manual curation, ensuring up-to-date sequence information, feature annotations, and cross-references. Recent growth in the RefSeq dataset has been significant, with an annual increase of 49.7% in the number of organisms and 14.5% in the number of accessions. Microbial organisms account for the majority of records, with a notable increase in microbial RNA records due to the RefSeq Targeted Locus project. The curation of human RefSeq records is actively managed by NCBI staff, focusing on improving quality, maintaining functional information, and coordinating with international curation groups. This includes defining pseudogene loci, maintaining RefSeqGene records, and reviewing transcripts and proteins. The database now tracks 92.5% of human protein-coding transcripts and 57.2% of non-coding transcripts with curated status. Recent changes to RefSeq include new policies for managing protein names, readthrough transcripts, and expanded representation of non-coding RNAs. Feature annotation has also been expanded to include localization, function, and details of the sequence considered during manual review. NCBI's genome annotation policy considers several factors, including sequencing quality, phylogenetic distance, and utility to research projects. The NCBI annotation pipeline is used for both prokaryotic and eukaryotic genomes, with a focus on providing a single standard annotation for reference genomes. Future directions for RefSeq include increasing transparency in curation decisions and supporting evidence, expanding transcript feature annotation, and reporting more explicit information about exon combinations and transcript variants.
Reach us at info@study.space