NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins

2005 | Kim D. Pruitt*, Tatiana Tatusova and Donna R. Maglott
The NCBI Reference Sequence (RefSeq) database is a curated, non-redundant collection of genomic, transcriptomic, and proteomic sequences. It provides a comprehensive dataset representing the complete sequence information for any given species, based on publicly available sequence data from archival databases. The database includes over 2400 organisms and over one million proteins, representing significant taxonomic diversity. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources such as the NCBI Map Viewer and Gene. Sequences are annotated using a combination of collaboration, automated annotation, propagation from GenBank, and curation by NCBI staff. RefSeq is distinct from GenBank in that it provides a nearly non-redundant collection that represents the 'current' view of sequence information, names, and annotations. RefSeq records are distinguished by their accession numbers, which have a specific format. The RefSeq database is continuously curated by collaborating groups and NCBI staff, and sequence records are presented in a standard format and subject to computational validation. RefSeq records are annotated from multiple sources, including original GenBank submissions, collaborating groups, NCBI computational analysis, user feedback, and manual curation. Annotation includes coding regions, conserved domains, variation, references, names, database cross-references, and other features. For some species, genome annotation is provided by NCBI computational processes that utilize transcript alignments, protein support, and a hidden Markov model (HMM) ab initio prediction algorithm. The RefSeq database provides a critical foundation for integrating sequence, genetic, and functional information and is used internationally as a standard for genome annotation. RefSeq sequences are validated to confirm accurate nucleotide-to-protein sequence correspondence, valid ASN.1 format, and current preferred name and symbol designations. The curation status is annotated on RefSeq records, with terms indicating the level of curation. RefSeq data can be accessed through various methods, including Entrez query, BLAST, FTP, and links provided from NCBI databases and resources. The RefSeq collection is made available for anonymous FTP as bi-monthly releases, with documentation indicating files and sequences provided, sequences removed since the previous release, and a full description of the release structure and content. Users can subscribe to the refseq-announce email list to receive information about RefSeq releases and planned modifications.The NCBI Reference Sequence (RefSeq) database is a curated, non-redundant collection of genomic, transcriptomic, and proteomic sequences. It provides a comprehensive dataset representing the complete sequence information for any given species, based on publicly available sequence data from archival databases. The database includes over 2400 organisms and over one million proteins, representing significant taxonomic diversity. Nucleotide and protein sequences are explicitly linked, and the sequences are linked to other resources such as the NCBI Map Viewer and Gene. Sequences are annotated using a combination of collaboration, automated annotation, propagation from GenBank, and curation by NCBI staff. RefSeq is distinct from GenBank in that it provides a nearly non-redundant collection that represents the 'current' view of sequence information, names, and annotations. RefSeq records are distinguished by their accession numbers, which have a specific format. The RefSeq database is continuously curated by collaborating groups and NCBI staff, and sequence records are presented in a standard format and subject to computational validation. RefSeq records are annotated from multiple sources, including original GenBank submissions, collaborating groups, NCBI computational analysis, user feedback, and manual curation. Annotation includes coding regions, conserved domains, variation, references, names, database cross-references, and other features. For some species, genome annotation is provided by NCBI computational processes that utilize transcript alignments, protein support, and a hidden Markov model (HMM) ab initio prediction algorithm. The RefSeq database provides a critical foundation for integrating sequence, genetic, and functional information and is used internationally as a standard for genome annotation. RefSeq sequences are validated to confirm accurate nucleotide-to-protein sequence correspondence, valid ASN.1 format, and current preferred name and symbol designations. The curation status is annotated on RefSeq records, with terms indicating the level of curation. RefSeq data can be accessed through various methods, including Entrez query, BLAST, FTP, and links provided from NCBI databases and resources. The RefSeq collection is made available for anonymous FTP as bi-monthly releases, with documentation indicating files and sequences provided, sequences removed since the previous release, and a full description of the release structure and content. Users can subscribe to the refseq-announce email list to receive information about RefSeq releases and planned modifications.
Reach us at info@study.space