2012 | Andre P Masella, Andrea K Bartram, Jakub M Truszkowski, Daniel G Brown, Josh D Neufeld
PANDAseq is a paired-end assembler designed for Illumina sequencing data, specifically for analyzing microbial communities by targeting amplicons of the 16S rRNA gene. The software rapidly assembles paired-end reads while correcting most errors, particularly those involving uncalled or miscalled bases, using quality information from the Illumina reads. Benchmark tests using simulated data, a pure source template, and a pooled template of genomic DNA from known organisms showed that PANDAseq assembles reads more efficiently and with higher accuracy compared to alternative methods. The software can handle billions of paired-end reads and scales well, with a 4-50% increase in the number of assembled sequences over naïve assembly, while maintaining high-quality sequence yields. PANDAseq's implementation involves a three-step process: determining primer locations, identifying optimal overlap, and reconstructing the complete sequence. The software uses probabilistic error correction and checks for various constraints such as sequence length and quality. Experimental validation using simulated data, a single-template PCR amplicon, and a defined mixture of genomic DNA fragments demonstrated PANDAseq's effectiveness in improving sequence assembly quality and quantity.PANDAseq is a paired-end assembler designed for Illumina sequencing data, specifically for analyzing microbial communities by targeting amplicons of the 16S rRNA gene. The software rapidly assembles paired-end reads while correcting most errors, particularly those involving uncalled or miscalled bases, using quality information from the Illumina reads. Benchmark tests using simulated data, a pure source template, and a pooled template of genomic DNA from known organisms showed that PANDAseq assembles reads more efficiently and with higher accuracy compared to alternative methods. The software can handle billions of paired-end reads and scales well, with a 4-50% increase in the number of assembled sequences over naïve assembly, while maintaining high-quality sequence yields. PANDAseq's implementation involves a three-step process: determining primer locations, identifying optimal overlap, and reconstructing the complete sequence. The software uses probabilistic error correction and checks for various constraints such as sequence length and quality. Experimental validation using simulated data, a single-template PCR amplicon, and a defined mixture of genomic DNA fragments demonstrated PANDAseq's effectiveness in improving sequence assembly quality and quantity.