[slides and audio] The sequence read archive%3A explosive growth of sequencing data

The Sequence Read Archive (SRA) is a public repository for next-generation sequencing data, part of the International Nucleotide Sequence Database Collaboration (INSDC), which includes NCBI, EBI, and DDBJ. The SRA was established in 2009 to store raw sequencing data generated by high-throughput sequencing platforms, enabling reproducible science. It serves as a core infrastructure for sharing pre-publication sequence data, supporting large-scale projects like the Human Microbiome and 1000 Genomes projects. Data requiring authorized access, such as human genome sequences, are stored separately in dbGaP and EGA. The SRA has surpassed 100 Terabases of open-access sequence data, with Illumina accounting for 84% of sequenced bases. The most common study types include Whole Genome Sequencing, Re-sequencing, Population Genomics, Metagenomics, and Epigenetics. The SRA accepts raw sequence data, including base calls, quality scores, and alignments in BAM format. It also supports various sequencing platforms and data formats, with the aim of balancing archival costs and data usability. The SRA metadata model includes six objects: study, sample, experiment, run, analysis, and submission. The metadata model has been updated to better represent new sequencing technologies. The SRA partners actively discuss approaches to manage the explosive growth of sequencing data, including reference-based compression, quality score quantization, and selective storage. The SRA data exchange model follows the INSDC policy of exchanging GenBank, EMBL-Bank, and DDBJ entries. Funding for the SRA comes from various organizations, including the DNA Data Bank of Japan, European Molecular Biology Laboratory, and the Wellcome Trust. The SRA continues to collaborate with the research community to explore appropriate data reduction approaches.The Sequence Read Archive (SRA) is a public repository for next-generation sequencing data, part of the International Nucleotide Sequence Database Collaboration (INSDC), which includes NCBI, EBI, and DDBJ. The SRA was established in 2009 to store raw sequencing data generated by high-throughput sequencing platforms, enabling reproducible science. It serves as a core infrastructure for sharing pre-publication sequence data, supporting large-scale projects like the Human Microbiome and 1000 Genomes projects. Data requiring authorized access, such as human genome sequences, are stored separately in dbGaP and EGA. The SRA has surpassed 100 Terabases of open-access sequence data, with Illumina accounting for 84% of sequenced bases. The most common study types include Whole Genome Sequencing, Re-sequencing, Population Genomics, Metagenomics, and Epigenetics. The SRA accepts raw sequence data, including base calls, quality scores, and alignments in BAM format. It also supports various sequencing platforms and data formats, with the aim of balancing archival costs and data usability. The SRA metadata model includes six objects: study, sample, experiment, run, analysis, and submission. The metadata model has been updated to better represent new sequencing technologies. The SRA partners actively discuss approaches to manage the explosive growth of sequencing data, including reference-based compression, quality score quantization, and selective storage. The SRA data exchange model follows the INSDC policy of exchanging GenBank, EMBL-Bank, and DDBJ entries. Funding for the SRA comes from various organizations, including the DNA Data Bank of Japan, European Molecular Biology Laboratory, and the Wellcome Trust. The SRA continues to collaborate with the research community to explore appropriate data reduction approaches.

The sequence read archive: explosive growth of sequencing data

2012 | Yuichi Kodama, Martin Shumway and Rasko Leinonen on behalf of the International Nucleotide Sequence Database Collaboration