[slides and audio] Assembling large genomes with single-molecule sequencing and locality-sensitive hashing

The authors present a method called MinHash Alignment Process (MHAP) for efficiently overlapping noisy, long reads from single-molecule sequencing data using probabilistic, locality-sensitive hashing. MHAP, combined with Celera Assembler, was used to reconstruct the genomes of *Escherichia coli*, *Saccharomyces cerevisiae*, *Arabidopsis thaliana*, *Drosophila melanogaster*, and the human genome from high-coverage single-molecule real-time (SMRT) sequencing data. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For *D. melanogaster*, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost. The authors also validate the assemblies by comparing them to the closest available reference genome and analyze the completeness of the *D. melanogaster* chromosomes and gene sequences. They further demonstrate the ability of long-read sequencing to better reconstruct the repeat-rich heterochromatic regions of eukaryotic chromosomes. Finally, they discuss the potential of probabilistic alignment and long-read sequencing for addressing the ever-expanding scale of genomic data.The authors present a method called MinHash Alignment Process (MHAP) for efficiently overlapping noisy, long reads from single-molecule sequencing data using probabilistic, locality-sensitive hashing. MHAP, combined with Celera Assembler, was used to reconstruct the genomes of *Escherichia coli*, *Saccharomyces cerevisiae*, *Arabidopsis thaliana*, *Drosophila melanogaster*, and the human genome from high-coverage single-molecule real-time (SMRT) sequencing data. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For *D. melanogaster*, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost. The authors also validate the assemblies by comparing them to the closest available reference genome and analyze the completeness of the *D. melanogaster* chromosomes and gene sequences. They further demonstrate the ability of long-read sequencing to better reconstruct the repeat-rich heterochromatic regions of eukaryotic chromosomes. Finally, they discuss the potential of probabilistic alignment and long-read sequencing for addressing the ever-expanding scale of genomic data.

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

August 14, 2014 | Konstantin Berlin¹,², Sergey Koren³,⁴,*, Chen-Shan Chin⁴, James Drake⁴, Jane M. Landolin⁴, and Adam M. Phillippy³