ALLPATHS: De novo assembly of whole-genome shotgun microreads

ALLPATHS: De novo assembly of whole-genome shotgun microreads

2008 | Jonathan Butler, Iain MacCallum, Michael Kleber, Ilya A. Shlyakhter, Matthew K. Belmonte, Eric S. Lander, Chad Nusbaum, David B. Jaffe
ALLPATHS is a de novo assembly method for whole-genome shotgun microreads. It addresses the challenge of assembling genomes from short reads generated by new sequencing technologies. The method is based on a theoretical analysis and an algorithm that can be applied to various types of DNA sequence data, including short reads and conventional sequence reads. The method uses a graph-based approach to represent assemblies, retaining intrinsic ambiguities such as those arising from polymorphism. This allows for more accurate and detailed assemblies compared to previous methods. For small to mid-sized genomes (up to 39 Mb), the method generates high-quality assemblies from 80× coverage using simulated reads. Bacterial genomes like Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, while larger genomes produce highly connected and accurate assemblies. The ALLPATHS algorithm involves two key concepts: finding all paths across a given read pair and localization. Localization allows for the isolation of small regions of the genome and their independent assembly. The algorithm is designed to handle the computational challenges of microread assembly, including the high number of overlaps and the potential for false overlaps. The method has been tested on simulated data, showing high completeness, continuity, and accuracy. For example, the assemblies of C. jejuni and E. coli have no errors, while other assemblies contain only a few errors. The method also performs well on real data generated from Solexa sequencing, producing a high-quality assembly with 99.1% coverage and only 12 discrepancies with the reference sequence. The ALLPATHS approach offers the potential to accurately capture polymorphism and systemic ambiguity in assemblies, making it a powerful tool for genome assembly. The method is applicable to a wide range of DNA sequence data and provides a generalized representation suitable for various types of sequencing data. The results demonstrate that high-quality assemblies can be achieved using microreads, even in the presence of sequencing errors and repetitive sequences.ALLPATHS is a de novo assembly method for whole-genome shotgun microreads. It addresses the challenge of assembling genomes from short reads generated by new sequencing technologies. The method is based on a theoretical analysis and an algorithm that can be applied to various types of DNA sequence data, including short reads and conventional sequence reads. The method uses a graph-based approach to represent assemblies, retaining intrinsic ambiguities such as those arising from polymorphism. This allows for more accurate and detailed assemblies compared to previous methods. For small to mid-sized genomes (up to 39 Mb), the method generates high-quality assemblies from 80× coverage using simulated reads. Bacterial genomes like Campylobacter jejuni and Escherichia coli assemble optimally, yielding single perfect contigs, while larger genomes produce highly connected and accurate assemblies. The ALLPATHS algorithm involves two key concepts: finding all paths across a given read pair and localization. Localization allows for the isolation of small regions of the genome and their independent assembly. The algorithm is designed to handle the computational challenges of microread assembly, including the high number of overlaps and the potential for false overlaps. The method has been tested on simulated data, showing high completeness, continuity, and accuracy. For example, the assemblies of C. jejuni and E. coli have no errors, while other assemblies contain only a few errors. The method also performs well on real data generated from Solexa sequencing, producing a high-quality assembly with 99.1% coverage and only 12 discrepancies with the reference sequence. The ALLPATHS approach offers the potential to accurately capture polymorphism and systemic ambiguity in assemblies, making it a powerful tool for genome assembly. The method is applicable to a wide range of DNA sequence data and provides a generalized representation suitable for various types of sequencing data. The results demonstrate that high-quality assemblies can be achieved using microreads, even in the presence of sequencing errors and repetitive sequences.
Reach us at info@study.space