2014 | Jiang, Hongshan, Lei, Rong, Ding, Shou-Wei, et al.
Skewer is a novel adapter trimming tool designed for next-generation sequencing (NGS) paired-end reads. Adapter trimming is essential for accurate data analysis, especially in applications like small RNA sequencing, genome DNA sequencing, and transcriptome RNA/cDNA sequencing. The authors developed a bit-masked k-difference matching algorithm, which has an expected time complexity of \(O(kn)\) and space complexity of \(O(m)\), where \(k\) is the maximum number of allowed differences, \(n\) is the read length, and \(m\) is the adapter length. This algorithm efficiently enumerates all candidates that meet a specified threshold, such as the error ratio. To improve accuracy, a statistical scoring scheme was designed to evaluate candidates during pattern matching, and additional scoring schemes were devised to leverage paired-end/mate-pair information when applicable. The tool, named Skewer, was implemented in C++ and integrated into an industry-standard Linux program. Experiments on simulated and real data, including small RNA sequencing, paired-end RNA sequencing, and Nextera long mate-pair (LMP) sequencing, demonstrated that Skewer outperforms other similar tools in terms of both accuracy and speed. Specifically, Skewer is one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing. The tool's performance in handling various NGS applications and its ability to maintain high accuracy make it a valuable tool for NGS data preprocessing.Skewer is a novel adapter trimming tool designed for next-generation sequencing (NGS) paired-end reads. Adapter trimming is essential for accurate data analysis, especially in applications like small RNA sequencing, genome DNA sequencing, and transcriptome RNA/cDNA sequencing. The authors developed a bit-masked k-difference matching algorithm, which has an expected time complexity of \(O(kn)\) and space complexity of \(O(m)\), where \(k\) is the maximum number of allowed differences, \(n\) is the read length, and \(m\) is the adapter length. This algorithm efficiently enumerates all candidates that meet a specified threshold, such as the error ratio. To improve accuracy, a statistical scoring scheme was designed to evaluate candidates during pattern matching, and additional scoring schemes were devised to leverage paired-end/mate-pair information when applicable. The tool, named Skewer, was implemented in C++ and integrated into an industry-standard Linux program. Experiments on simulated and real data, including small RNA sequencing, paired-end RNA sequencing, and Nextera long mate-pair (LMP) sequencing, demonstrated that Skewer outperforms other similar tools in terms of both accuracy and speed. Specifically, Skewer is one times faster for single-end sequencing, more than 12 times faster for paired-end sequencing, and 49% faster for LMP sequencing. The tool's performance in handling various NGS applications and its ability to maintain high accuracy make it a valuable tool for NGS data preprocessing.