Understanding Canu%3A scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

Canu is a scalable and accurate long-read assembly tool designed for noisy single-molecule sequencing data, particularly from nanopore sequencers. It addresses the challenges of assembling large repeats and closely related haplotypes by introducing an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu halves depth-of-coverage requirements, improves assembly continuity, and reduces runtime by an order of magnitude compared to Celera Assembler 8.2. It can assemble complete microbial genomes and near-complete eukaryotic chromosomes using PacBio or Oxford Nanopore technologies, achieving a contig NG50 of over 21 Mbp on human and *Drosophila melanogaster* PacBio datasets. Canu also provides graph-based assembly outputs in GFA format for complex genomes, combining highly resolved assembly graphs with long-range scaffolding information. The tool is modular, allowing for independent or sequential execution of correction, trimming, and assembly stages, and it supports distributed computing for large genomes.Canu is a scalable and accurate long-read assembly tool designed for noisy single-molecule sequencing data, particularly from nanopore sequencers. It addresses the challenges of assembling large repeats and closely related haplotypes by introducing an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. Canu halves depth-of-coverage requirements, improves assembly continuity, and reduces runtime by an order of magnitude compared to Celera Assembler 8.2. It can assemble complete microbial genomes and near-complete eukaryotic chromosomes using PacBio or Oxford Nanopore technologies, achieving a contig NG50 of over 21 Mbp on human and *Drosophila melanogaster* PacBio datasets. Canu also provides graph-based assembly outputs in GFA format for complex genomes, combining highly resolved assembly graphs with long-range scaffolding information. The tool is modular, allowing for independent or sequential execution of correction, trimming, and assembly stages, and it supports distributed computing for large genomes.

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation

| Sergey Koren, Brian P. Walenz, Konstantin Berlin, Jason R. Miller, Nicholas H. Bergman, Adam M. Phillippy