Understanding RefSeq%3A an update on mammalian reference sequences

The National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript, and protein sequence records derived from public sequence archives and computational, curation, and collaboration efforts. This article reports on the growth of the mammalian and human subsets, changes to the NCBI eukaryotic annotation pipeline, and modifications affecting transcript and protein records. Recent improvements to the eukaryotic genome annotation pipeline have increased throughput and included RNAseq data, leading to a significant expansion of annotated transcripts and novel exons in mammalian RefSeq genomes. New annotation changes include reporting supporting evidence for transcript records, modifying exon feature annotation, and adding structured reports of gene and sequence attributes. A revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins is also described, along with the current status of the RefSeqGene project. The RefSeq database provides sequence records for genomes, transcripts, and proteins of viruses, microbes, organelles, and eukaryotic organisms. Mammalian genomic records include annotated nuclear and mitochondrial genomes, non-transcribed pseudogenes, haplotype-specific regions, and RefSeqGene records. Transcript records may be protein-coding, non-coding (ncRNA), or structural RNAs. RefSeq records are generated through automatic and manual processing of public sequence data, NCBI's eukaryotic genome annotation pipeline, and expert databases. Curation by NCBI staff and collaboration with external groups help maintain the quality of the collection and related resources. RefSeq data are available via FTP and the NCBI website, with new and updated records provided daily and full releases weekly. RefSeq release 61 included over 41 million sequence records from over 29,000 organisms. The RefSeq database has grown significantly, with the number of mammalian and human RefSeq transcript records increasing over time. The eukaryotic genome annotation pipeline has been improved, including the addition of RNAseq data, leading to a larger increase in model RefSeqs. The RefSeqGene project provides stable coordinate systems for clinical testing laboratories. The eukaryotic annotation pipeline now includes RNAseq data, improving the identification of protein-coding and ncRNA alternative splice variants. The pipeline has been enhanced with automation and parallel computing, resulting in a significant increase in genome annotations. The addition of RNAseq data to the pipeline allows for better representation of alternative splice variants and exons. Recent policy changes allow a gene to be represented by a mixture of known and predicted model RefSeqs, with modifications to the annotation pipeline to include computed transcript variants and protein isoform names. The RefSeq database has expanded to include more mammalian and human records, with the number of RefSeq transcript records increasing. The RefSeq database provides structured comments for supporting evidence and biological attributes, including evidence data and RefSeq attributes. These comments help in evaluating alignments and gene predictions. The RefSeq databaseThe National Center for Biotechnology Information (NCBI) Reference Sequence (RefSeq) database is a collection of annotated genomic, transcript, and protein sequence records derived from public sequence archives and computational, curation, and collaboration efforts. This article reports on the growth of the mammalian and human subsets, changes to the NCBI eukaryotic annotation pipeline, and modifications affecting transcript and protein records. Recent improvements to the eukaryotic genome annotation pipeline have increased throughput and included RNAseq data, leading to a significant expansion of annotated transcripts and novel exons in mammalian RefSeq genomes. New annotation changes include reporting supporting evidence for transcript records, modifying exon feature annotation, and adding structured reports of gene and sequence attributes. A revised protein annotation policy for alternatively spliced transcripts with more divergent predicted proteins is also described, along with the current status of the RefSeqGene project. The RefSeq database provides sequence records for genomes, transcripts, and proteins of viruses, microbes, organelles, and eukaryotic organisms. Mammalian genomic records include annotated nuclear and mitochondrial genomes, non-transcribed pseudogenes, haplotype-specific regions, and RefSeqGene records. Transcript records may be protein-coding, non-coding (ncRNA), or structural RNAs. RefSeq records are generated through automatic and manual processing of public sequence data, NCBI's eukaryotic genome annotation pipeline, and expert databases. Curation by NCBI staff and collaboration with external groups help maintain the quality of the collection and related resources. RefSeq data are available via FTP and the NCBI website, with new and updated records provided daily and full releases weekly. RefSeq release 61 included over 41 million sequence records from over 29,000 organisms. The RefSeq database has grown significantly, with the number of mammalian and human RefSeq transcript records increasing over time. The eukaryotic genome annotation pipeline has been improved, including the addition of RNAseq data, leading to a larger increase in model RefSeqs. The RefSeqGene project provides stable coordinate systems for clinical testing laboratories. The eukaryotic annotation pipeline now includes RNAseq data, improving the identification of protein-coding and ncRNA alternative splice variants. The pipeline has been enhanced with automation and parallel computing, resulting in a significant increase in genome annotations. The addition of RNAseq data to the pipeline allows for better representation of alternative splice variants and exons. Recent policy changes allow a gene to be represented by a mixture of known and predicted model RefSeqs, with modifications to the annotation pipeline to include computed transcript variants and protein isoform names. The RefSeq database has expanded to include more mammalian and human records, with the number of RefSeq transcript records increasing. The RefSeq database provides structured comments for supporting evidence and biological attributes, including evidence data and RefSeq attributes. These comments help in evaluating alignments and gene predictions. The RefSeq database

RefSeq: an update on mammalian reference sequences