[slides] ART%3A a next-generation sequencing read simulator

ART is a next-generation sequencing read simulator that generates synthetic sequencing reads to test and benchmark tools for next-generation sequencing data analysis. It emulates the sequencing process with built-in, technology-specific read error models and base quality value profiles derived from large sequencing datasets. ART supports all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. It allows users to customize read error models and quality profiles. ART generates synthetic reads that mimic the technology-specific sequencing process, and can report simulated reads in the standard SAM alignment format and UCSC BED files. It simulates both single-end and paired-end sequencing reads for the three main commercial platforms. The built-in read length and error profiles are derived from large sets of actual sequencing data. ART supports all three types of common sequencing errors: base substitutions, insertions and deletions. For Illumina reads, substitution errors are simulated based on the empirical, position-dependent distribution of base quality scores. For 454 reads, the dominant error mode is base over- or under-call, resulting in INDEL type errors. ART models 454 sequencing error profiles with homopolymer length-dependent over-call and under-call error distributions. For SOLiD reads, ART generates nucleotide transition codes or 'color-space' reads. For paired-end read simulations, a Gaussian distribution is used to model the distribution of DNA fragment sizes. ART's performance was tested using human chromosome 17 as a reference, generating reads representing 10× coverage for each of the three sequencing platforms. The test was performed on a desktop computer with an Intel Xeon 2.93 GHz CPU, running a Linux operating system. This procedure took less than 12 minutes, with Illumina reads being the fastest and SOLiD reads the slowest.ART is a next-generation sequencing read simulator that generates synthetic sequencing reads to test and benchmark tools for next-generation sequencing data analysis. It emulates the sequencing process with built-in, technology-specific read error models and base quality value profiles derived from large sequencing datasets. ART supports all three major commercial next-generation sequencing platforms: Roche's 454, Illumina's Solexa and Applied Biosystems' SOLiD. It allows users to customize read error models and quality profiles. ART generates synthetic reads that mimic the technology-specific sequencing process, and can report simulated reads in the standard SAM alignment format and UCSC BED files. It simulates both single-end and paired-end sequencing reads for the three main commercial platforms. The built-in read length and error profiles are derived from large sets of actual sequencing data. ART supports all three types of common sequencing errors: base substitutions, insertions and deletions. For Illumina reads, substitution errors are simulated based on the empirical, position-dependent distribution of base quality scores. For 454 reads, the dominant error mode is base over- or under-call, resulting in INDEL type errors. ART models 454 sequencing error profiles with homopolymer length-dependent over-call and under-call error distributions. For SOLiD reads, ART generates nucleotide transition codes or 'color-space' reads. For paired-end read simulations, a Gaussian distribution is used to model the distribution of DNA fragment sizes. ART's performance was tested using human chromosome 17 as a reference, generating reads representing 10× coverage for each of the three sequencing platforms. The test was performed on a desktop computer with an Intel Xeon 2.93 GHz CPU, running a Linux operating system. This procedure took less than 12 minutes, with Illumina reads being the fastest and SOLiD reads the slowest.

ART: a next-generation sequencing read simulator

December 23, 2011 | Weichun Huang, Leping Li, Jason R. Myers, Gabor T. Marth