| Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, Adam M. Phillippy
Canu is a scalable and accurate long-read assembler that improves upon and replaces the now unsupported Celera Assembler. It is specifically designed for noisy single-molecule sequences and supports both PacBio and Oxford Nanopore sequencing. Canu halves the depth-of-coverage requirements and improves assembly continuity while reducing runtime by an order of magnitude on large genomes compared to Celera Assembler 8.2. These improvements come from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, achieving a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Canu introduces several novel features including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (GFA) outputs. The Canu pipeline consists of three stages—correction, trimming, and assembly—each of which can run independently or in series. When running in a parallel environment, Canu auto-detects available resources and configures itself to maximize resource utilization. It is currently the most efficient single-molecule read assembler available for large genomes, requiring approximately 20,000 CPU hours to assemble a human genome, compared to 60,000 required for Falcon and >250,000 required for Celera Assembler v8.2. In addition to these runtime improvements, the resulting assemblies are significantly more continuous than prior versions.
Canu uses a variant of the greedy “best overlap graph” (BOG) algorithm for constructing a sparse overlap graph. Loading the full overlap graph into memory can be costly for large, complex genomes. In contrast, the greedy algorithm loads only the “best” (longest) overlaps for each read end into memory. This greedy approach is optimal when the read length is sufficiently long, and a best overlap graph can be built using just 64 GB of memory for a mammalian genome. However, the greedy algorithm can be misled by repeats that are longer than the overlap length and is therefore prone to mis-assemblies. Canu's new “Bogart” algorithm addresses this problem by statistically filtering repeat-induced overlaps and retrospectively inspecting the graph for potential errors.
Canu's adaptive tf-idf weighting scheme requires no parameter adjustment and achieves 89% sensitivity and maintains high PPV with no added runtime orCanu is a scalable and accurate long-read assembler that improves upon and replaces the now unsupported Celera Assembler. It is specifically designed for noisy single-molecule sequences and supports both PacBio and Oxford Nanopore sequencing. Canu halves the depth-of-coverage requirements and improves assembly continuity while reducing runtime by an order of magnitude on large genomes compared to Celera Assembler 8.2. These improvements come from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either PacBio or Oxford Nanopore technologies, achieving a contig NG50 of greater than 21 Mbp on both human and Drosophila melanogaster PacBio datasets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.
Canu introduces several novel features including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (GFA) outputs. The Canu pipeline consists of three stages—correction, trimming, and assembly—each of which can run independently or in series. When running in a parallel environment, Canu auto-detects available resources and configures itself to maximize resource utilization. It is currently the most efficient single-molecule read assembler available for large genomes, requiring approximately 20,000 CPU hours to assemble a human genome, compared to 60,000 required for Falcon and >250,000 required for Celera Assembler v8.2. In addition to these runtime improvements, the resulting assemblies are significantly more continuous than prior versions.
Canu uses a variant of the greedy “best overlap graph” (BOG) algorithm for constructing a sparse overlap graph. Loading the full overlap graph into memory can be costly for large, complex genomes. In contrast, the greedy algorithm loads only the “best” (longest) overlaps for each read end into memory. This greedy approach is optimal when the read length is sufficiently long, and a best overlap graph can be built using just 64 GB of memory for a mammalian genome. However, the greedy algorithm can be misled by repeats that are longer than the overlap length and is therefore prone to mis-assemblies. Canu's new “Bogart” algorithm addresses this problem by statistically filtering repeat-induced overlaps and retrospectively inspecting the graph for potential errors.
Canu's adaptive tf-idf weighting scheme requires no parameter adjustment and achieves 89% sensitivity and maintains high PPV with no added runtime or