2016, Vol. 44, Database issue | Nuala A. O'Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, Alexander Astashyn, Azat Badretdin, Yiming Bao, Olga Blinkova, Vyacheslav Brover, Vyacheslav Chetvernин, Jinna Choi, Eric Cox, Olga Ermolaeva, Catherine M. Farrell, Tamara Goldfarb, Tripti Gupta, Daniel Haft, Eneida Hatcher, Wratko Hlavina, Vinita S. Joardar, Vamsi K. Kodali, Wenjun Li, Donna Magliott, Patrick Masterson, Kelly M. McGarvey, Michael R. Murphy, Kathleen O'Neill, Shashikant Pujar, Sanjida H. Rangwala, Daniel Rausch, Lillian D. Riddick, Conrad Schoch, Andrei Shkeda, Susan S. Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Igor Tolstoy, Raymond E. Tully, Anjana R. Vatsan, Craig Wallin, David Webb, Wendy Wu, Melissa J. Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Terence D. Murphy and Kim D. Pruitt
The RefSeq database at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records. The project leverages data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) and combines computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The database currently represents sequences from over 55,000 organisms, including viruses, prokaryotes, and eukaryotes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access, and details efforts to further expand the taxonomic representation of the collection. The authors also highlight diverse functional curation initiatives that support multiple uses of RefSeq data, including taxonomic validation, genome annotation, comparative genomics, and clinical testing. The paper discusses the approach to utilizing available RNA-Seq and other data types in the manual curation process for vertebrate, plant, and other species, and describes a new direction for prokaryotic genomes and protein name management.The RefSeq database at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records. The project leverages data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) and combines computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The database currently represents sequences from over 55,000 organisms, including viruses, prokaryotes, and eukaryotes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access, and details efforts to further expand the taxonomic representation of the collection. The authors also highlight diverse functional curation initiatives that support multiple uses of RefSeq data, including taxonomic validation, genome annotation, comparative genomics, and clinical testing. The paper discusses the approach to utilizing available RNA-Seq and other data types in the manual curation process for vertebrate, plant, and other species, and describes a new direction for prokaryotic genomes and protein name management.