2008 | Juliane C. Dohm, Claudio Lottaz, Tatiana Borodina and Heinz Himmelbauer
This study investigates biases in ultra-short read data sets from high-throughput DNA sequencing, specifically focusing on Solexa (Illumina) sequencing. Two data sets were analyzed: 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mer reads from the Helicobacter acinonychis genome. The results show that error rates vary from 0.3% at the beginning of reads to 3.8% at the end. Base substitution errors are more frequent for A > C transversions and least frequent for G > G transversions. Single base insertions and deletions are rare. Error compensation by correct reads is effective at 20-fold coverage. Read coverage is biased towards regions with high GC content. High Solexa quality scores are over-optimistic, while low scores underestimate data quality. The study highlights various biases in Solexa data, including error positions, error rates, and base call errors. These biases affect the use and interpretation of Solexa data in de novo sequencing, re-sequencing, SNP identification, DNA methylation site detection, and transcriptome analysis. The study also shows that error positions are often preceded by base G, and that error rates vary significantly across different positions in the read. The analysis of error base sequence context reveals that G is frequently involved in errors. The study also finds that substitution errors are more common in certain bases, such as C > G transversions. The results suggest that Solexa sequencing has biases in read coverage and error rates, which need to be considered when interpreting data. The study concludes that Solexa sequencing has limitations in terms of error rates and coverage, and that these biases can impact the accuracy of genomic analyses.This study investigates biases in ultra-short read data sets from high-throughput DNA sequencing, specifically focusing on Solexa (Illumina) sequencing. Two data sets were analyzed: 2.8 million 27mer reads from a Beta vulgaris genomic clone and 12.3 million 36mer reads from the Helicobacter acinonychis genome. The results show that error rates vary from 0.3% at the beginning of reads to 3.8% at the end. Base substitution errors are more frequent for A > C transversions and least frequent for G > G transversions. Single base insertions and deletions are rare. Error compensation by correct reads is effective at 20-fold coverage. Read coverage is biased towards regions with high GC content. High Solexa quality scores are over-optimistic, while low scores underestimate data quality. The study highlights various biases in Solexa data, including error positions, error rates, and base call errors. These biases affect the use and interpretation of Solexa data in de novo sequencing, re-sequencing, SNP identification, DNA methylation site detection, and transcriptome analysis. The study also shows that error positions are often preceded by base G, and that error rates vary significantly across different positions in the read. The analysis of error base sequence context reveals that G is frequently involved in errors. The study also finds that substitution errors are more common in certain bases, such as C > G transversions. The results suggest that Solexa sequencing has biases in read coverage and error rates, which need to be considered when interpreting data. The study concludes that Solexa sequencing has limitations in terms of error rates and coverage, and that these biases can impact the accuracy of genomic analyses.