PANDAseq: PAired-eND Assembler for Illumina sequences

PANDAseq: PAired-eND Assembler for Illumina sequences

2012 | Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown, Josh D Neufeld
PANDAseq is a tool for assembling Illumina paired-end reads with error correction. It rapidly assembles sequences and scales to billions of reads. It improves sequence yields by correcting errors using quality information. PANDAseq outperforms alternative methods in speed and accuracy. It uses probabilistic error correction based on overlap data from paired-end reads. It determines the proper overlap and reconstructs the entire sequence by correcting errors in the overlapping region. The software aligns each set of paired-end reads in three steps: determining primer locations, identifying optimal overlap, and reconstructing the complete sequence. It calculates the probability that true nucleotides are the same given observed nucleotides and quality information. It uses this to correct errors and construct the output sequence. PANDAseq also handles masked quality scores and special quality scoring for masked regions. It validates sequences against user-specified criteria and allows users to reject sequences based on quality, length, or presence of uncalled bases. It provides a module system for more sophisticated validation. PANDAseq was tested on simulated data, single-template PCR amplicons, and experimental data from a defined mixture of genomic DNA fragments. It showed improved accuracy and sequence yields compared to naive assembly and existing assemblers. It assembled sequences quickly and efficiently, with a high accuracy rate. It was compared to other assemblers like SHERA, iTags, and BIPES, and outperformed them in accuracy and speed. PANDAseq is available under the GNU GPL license and is compatible with POSIX-compliant operating systems. It provides a versatile and powerful way to assemble paired-end Illumina reads without discarding high-quality sequence data.PANDAseq is a tool for assembling Illumina paired-end reads with error correction. It rapidly assembles sequences and scales to billions of reads. It improves sequence yields by correcting errors using quality information. PANDAseq outperforms alternative methods in speed and accuracy. It uses probabilistic error correction based on overlap data from paired-end reads. It determines the proper overlap and reconstructs the entire sequence by correcting errors in the overlapping region. The software aligns each set of paired-end reads in three steps: determining primer locations, identifying optimal overlap, and reconstructing the complete sequence. It calculates the probability that true nucleotides are the same given observed nucleotides and quality information. It uses this to correct errors and construct the output sequence. PANDAseq also handles masked quality scores and special quality scoring for masked regions. It validates sequences against user-specified criteria and allows users to reject sequences based on quality, length, or presence of uncalled bases. It provides a module system for more sophisticated validation. PANDAseq was tested on simulated data, single-template PCR amplicons, and experimental data from a defined mixture of genomic DNA fragments. It showed improved accuracy and sequence yields compared to naive assembly and existing assemblers. It assembled sequences quickly and efficiently, with a high accuracy rate. It was compared to other assemblers like SHERA, iTags, and BIPES, and outperformed them in accuracy and speed. PANDAseq is available under the GNU GPL license and is compatible with POSIX-compliant operating systems. It provides a versatile and powerful way to assemble paired-end Illumina reads without discarding high-quality sequence data.
Reach us at info@study.space