Advance Access publication December 23, 2011 | Weichun Huang, Leping Li, Jason R. Myers, Gabor T. Marth
ART is a comprehensive set of simulation tools designed to generate synthetic next-generation sequencing reads, essential for testing and benchmarking tools used in read alignment, de novo assembly, and genetic variation discovery. ART supports three major commercial platforms: Roche’s 454, Illumina’s Solexa, and Applied Biosystems’ SOLiD. It emulates the sequencing process with technology-specific read error models and base quality value profiles derived from large datasets. ART can simulate both single-end and paired-end reads, including substitution, insertion, and deletion errors. The software is available in both source and binary formats and can report simulated reads in SAM alignment format and UCSC BED files. ART has been widely used for sequencing software development and has been particularly useful in the context of the 1000 Genomes Project. The article also provides details on the simulation methods for each platform, including the empirical distributions and error rates used.ART is a comprehensive set of simulation tools designed to generate synthetic next-generation sequencing reads, essential for testing and benchmarking tools used in read alignment, de novo assembly, and genetic variation discovery. ART supports three major commercial platforms: Roche’s 454, Illumina’s Solexa, and Applied Biosystems’ SOLiD. It emulates the sequencing process with technology-specific read error models and base quality value profiles derived from large datasets. ART can simulate both single-end and paired-end reads, including substitution, insertion, and deletion errors. The software is available in both source and binary formats and can report simulated reads in SAM alignment format and UCSC BED files. ART has been widely used for sequencing software development and has been particularly useful in the context of the 1000 Genomes Project. The article also provides details on the simulation methods for each platform, including the empirical distributions and error rates used.