The sequence read archive: explosive growth of sequencing data

The sequence read archive: explosive growth of sequencing data

2012, Vol. 40, Database issue | Yuichi Kodama1*, Martin Shumway2 and Rasko Leinonen3 on behalf of the International Nucleotide Sequence Database Collaboration
The article discusses the explosive growth of next-generation sequencing data and the role of the Sequence Read Archive (SRA) in archiving and sharing this data. The SRA, established as part of the International Nucleotide Sequence Database Collaboration (INSDC), is a public repository for raw sequencing data from various platforms. It aims to facilitate reproducible science by providing access to pre-publication sequence data for large-scale international projects. In 2011, the SRA had surpassed 100 Terabases of open-access genome sequence reads, with the Illumina platform accounting for 84% of sequenced bases. The SRA accepts raw sequence data, including base calls and quality scores, and supports multiple file formats and sequencing platforms. The metadata model used by the SRA includes six objects: study, sample, experiment, run, analysis, and submission, each with unique identifiers. The SRA has updated its metadata model to better represent new sequencing technologies and applications. The challenge of handling the rapid growth of data is addressed through various approaches, such as reference-based compression, quantization of base quality values, and selective storage of metadata. The SRA partners actively discuss and pursue these approaches to maximize the benefit of archiving while minimizing infrastructure costs.The article discusses the explosive growth of next-generation sequencing data and the role of the Sequence Read Archive (SRA) in archiving and sharing this data. The SRA, established as part of the International Nucleotide Sequence Database Collaboration (INSDC), is a public repository for raw sequencing data from various platforms. It aims to facilitate reproducible science by providing access to pre-publication sequence data for large-scale international projects. In 2011, the SRA had surpassed 100 Terabases of open-access genome sequence reads, with the Illumina platform accounting for 84% of sequenced bases. The SRA accepts raw sequence data, including base calls and quality scores, and supports multiple file formats and sequencing platforms. The metadata model used by the SRA includes six objects: study, sample, experiment, run, analysis, and submission, each with unique identifiers. The SRA has updated its metadata model to better represent new sequencing technologies and applications. The challenge of handling the rapid growth of data is addressed through various approaches, such as reference-based compression, quantization of base quality values, and selective storage of metadata. The SRA partners actively discuss and pursue these approaches to maximize the benefit of archiving while minimizing infrastructure costs.
Reach us at info@study.space