[slides] Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

The study by Dohm et al. (2008) characterizes the biases in ultra-short read data sets from high-throughput DNA sequencing, specifically focusing on two Illumina 1G ultra-short read data sets: 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. The authors found that error rates range from 0.3% at the beginning of reads to 3.8% at the end, with wrong base calls often preceded by base G. Base substitution errors vary in frequency, with A > C transversions being the most common and C > G transversions the least frequent. Insertions and deletions are rare. The study also found that a 20-fold sequencing coverage is sufficient to compensate for errors by correct reads. Read coverage is biased, with higher coverage in regions of elevated GC content. Solexa quality scores are over-optimistic for high scores and underestimate the data quality for low scores. These findings have implications for the use and interpretation of Solexa data in various applications, including de novo sequencing, re-sequencing, SNP identification, DNA methylation site detection, and transcriptome analysis.The study by Dohm et al. (2008) characterizes the biases in ultra-short read data sets from high-throughput DNA sequencing, specifically focusing on two Illumina 1G ultra-short read data sets: 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mers from the Helicobacter acinonychis genome. The authors found that error rates range from 0.3% at the beginning of reads to 3.8% at the end, with wrong base calls often preceded by base G. Base substitution errors vary in frequency, with A > C transversions being the most common and C > G transversions the least frequent. Insertions and deletions are rare. The study also found that a 20-fold sequencing coverage is sufficient to compensate for errors by correct reads. Read coverage is biased, with higher coverage in regions of elevated GC content. Solexa quality scores are over-optimistic for high scores and underestimate the data quality for low scores. These findings have implications for the use and interpretation of Solexa data in various applications, including de novo sequencing, re-sequencing, SNP identification, DNA methylation site detection, and transcriptome analysis.

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing

2008 | Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina and Heinz Himmelbauer