[slides and audio] The Sequence Read Archive

The Sequence Read Archive (SRA) is an international public archive for next-generation sequencing data, managed by the International Nucleotide Sequence Database Collaboration (INSDC), which includes the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ). The SRA provides free, unrestricted access to sequencing data and is used by many journals and funding agencies. As sequencing costs have decreased and speed increased, the SRA has seen explosive growth in data submissions. The SRA contains over 500 billion reads, with 80% from the Illumina GA platform. The SRA supports sequencing platforms such as Roche/454, Illumina, and SOLiD, and is accessible via NCBI, EBI, and DDBJ. The SRA recommends specific data submission levels and formats, including Sequence Read Format (SRF) for Illumina and SOLiD, and Standard Flowgram Format (SFF) for 454. The SRA also stores metadata in XML format, with six objects representing study, sample, experiment, run, analysis, and submission. The SRA is working to define an archival BAM format for read alignments. Efficient storage and compression are key objectives of the SRA, with the NCBI SRA Toolkit used for data exchange. The SRA is exploring better compression methods, including reference-based compression. The SRA is also addressing the challenge of data growth by evaluating the value of different data types and implementing more efficient compression strategies. Funding for the SRA comes from various sources, including the European Molecular Biology Laboratory, the European Commission, and the Wellcome Trust.The Sequence Read Archive (SRA) is an international public archive for next-generation sequencing data, managed by the International Nucleotide Sequence Database Collaboration (INSDC), which includes the National Center for Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ). The SRA provides free, unrestricted access to sequencing data and is used by many journals and funding agencies. As sequencing costs have decreased and speed increased, the SRA has seen explosive growth in data submissions. The SRA contains over 500 billion reads, with 80% from the Illumina GA platform. The SRA supports sequencing platforms such as Roche/454, Illumina, and SOLiD, and is accessible via NCBI, EBI, and DDBJ. The SRA recommends specific data submission levels and formats, including Sequence Read Format (SRF) for Illumina and SOLiD, and Standard Flowgram Format (SFF) for 454. The SRA also stores metadata in XML format, with six objects representing study, sample, experiment, run, analysis, and submission. The SRA is working to define an archival BAM format for read alignments. Efficient storage and compression are key objectives of the SRA, with the NCBI SRA Toolkit used for data exchange. The SRA is exploring better compression methods, including reference-based compression. The SRA is also addressing the challenge of data growth by evaluating the value of different data types and implementing more efficient compression strategies. Funding for the SRA comes from various sources, including the European Molecular Biology Laboratory, the European Commission, and the Wellcome Trust.

The Sequence Read Archive

2011 | Rasko Leinonen, Hideaki Sugawara and Martin Shumway on behalf of the International Nucleotide Sequence Database Collaboration