Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

| Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, Adam M. Phillippy
Canu is a scalable and accurate long-read assembly tool designed for noisy single-molecule sequencing data, particularly from nanopore sequencers. It addresses the challenges of assembling large repeats and closely related haplotypes by introducing an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu halves depth-of-coverage requirements, improves assembly continuity, and reduces runtime by an order of magnitude compared to Celera Assembler 8.2. It can assemble complete microbial genomes and near-complete eukaryotic chromosomes using PacBio or Oxford Nanopore technologies, achieving a contig NG50 of over 21 Mbp on human and *Drosophila melanogaster* PacBio datasets. Canu also provides graph-based assembly outputs in GFA format for complex genomes, combining highly resolved assembly graphs with long-range scaffolding information. The tool is modular, allowing for independent or sequential execution of correction, trimming, and assembly stages, and it supports distributed computing for large genomes.Canu is a scalable and accurate long-read assembly tool designed for noisy single-molecule sequencing data, particularly from nanopore sequencers. It addresses the challenges of assembling large repeats and closely related haplotypes by introducing an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu halves depth-of-coverage requirements, improves assembly continuity, and reduces runtime by an order of magnitude compared to Celera Assembler 8.2. It can assemble complete microbial genomes and near-complete eukaryotic chromosomes using PacBio or Oxford Nanopore technologies, achieving a contig NG50 of over 21 Mbp on human and *Drosophila melanogaster* PacBio datasets. Canu also provides graph-based assembly outputs in GFA format for complex genomes, combining highly resolved assembly graphs with long-range scaffolding information. The tool is modular, allowing for independent or sequential execution of correction, trimming, and assembly stages, and it supports distributed computing for large genomes.
Reach us at info@study.space