Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing

August 14, 2014 | Konstantin Berlin¹,², Sergey Koren³,⁴,*, Chen-Shan Chin⁴, James Drake⁴, Jane M. Landolin⁴, and Adam M. Phillippy³
This preprint describes a method for assembling large genomes using single-molecule sequencing and locality-sensitive hashing. The authors introduce the MinHash Alignment Process (MHAP), a probabilistic algorithm that efficiently aligns noisy, long reads using MinHash sketches. MHAP is combined with Celera Assembler to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and the human genome from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost. The authors demonstrate that MHAP enables efficient and complete assembly of large genomes. They integrated MHAP into the Celera Assembler PBcR hierarchical assembly pipeline and assembled the genomes of Escherichia coli K12, Saccharomyces cerevisiae W303, Drosophila melanogaster ISO1, Arabidopsis thaliana Ler-0, and the haploid human cell line CHM1htert from high-coverage SMRT sequencing data. The resulting assemblies are superior to any previous de novo assemblies of these organisms and were produced in a fraction of the time required using previous overlapping algorithms. These assemblies include novel heterochromatic sequences and fill persistent gaps remaining in the reference genomes of these important organisms. The authors also present a de novo human assembly using long reads. They assembled 54X SMRT sequencing reads from the haploid human cell line CHM1htert and compared their assembly to the human GRCh38 reference. The contig N50 of their long-read de novo assembly of CHM1 is an order of magnitude larger than both the CHM1 Illumina assembly and the original Sanger-based assemblies of human. Based on a comparison to the GRCh38 reference, the average number of contigs per chromosome in their assembly is 92. They potentially resolve 51 of 819 (6%) annotated gaps in GRCh38. Of these putative gap closures, 16 match the estimated gap size in the reference, but require further validation to confirm they represent true gap closures. The authors also validate the resulting assemblies by comparing each assembly to the closest available reference genome using dnadiff. All assemblies are structurally concordant with the reference sequences. They analyze the completeness of the D. melanogaster chromosomes and gene sequences and find that their assembly covers 122.9Mbp (95.7%) of the 129.7Mbp version 5 reference genome. They alsoThis preprint describes a method for assembling large genomes using single-molecule sequencing and locality-sensitive hashing. The authors introduce the MinHash Alignment Process (MHAP), a probabilistic algorithm that efficiently aligns noisy, long reads using MinHash sketches. MHAP is combined with Celera Assembler to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and the human genome from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost. The authors demonstrate that MHAP enables efficient and complete assembly of large genomes. They integrated MHAP into the Celera Assembler PBcR hierarchical assembly pipeline and assembled the genomes of Escherichia coli K12, Saccharomyces cerevisiae W303, Drosophila melanogaster ISO1, Arabidopsis thaliana Ler-0, and the haploid human cell line CHM1htert from high-coverage SMRT sequencing data. The resulting assemblies are superior to any previous de novo assemblies of these organisms and were produced in a fraction of the time required using previous overlapping algorithms. These assemblies include novel heterochromatic sequences and fill persistent gaps remaining in the reference genomes of these important organisms. The authors also present a de novo human assembly using long reads. They assembled 54X SMRT sequencing reads from the haploid human cell line CHM1htert and compared their assembly to the human GRCh38 reference. The contig N50 of their long-read de novo assembly of CHM1 is an order of magnitude larger than both the CHM1 Illumina assembly and the original Sanger-based assemblies of human. Based on a comparison to the GRCh38 reference, the average number of contigs per chromosome in their assembly is 92. They potentially resolve 51 of 819 (6%) annotated gaps in GRCh38. Of these putative gap closures, 16 match the estimated gap size in the reference, but require further validation to confirm they represent true gap closures. The authors also validate the resulting assemblies by comparing each assembly to the closest available reference genome using dnadiff. All assemblies are structurally concordant with the reference sequences. They analyze the completeness of the D. melanogaster chromosomes and gene sequences and find that their assembly covers 122.9Mbp (95.7%) of the 129.7Mbp version 5 reference genome. They also
Reach us at info@study.space
[slides and audio] Assembling large genomes with single-molecule sequencing and locality-sensitive hashing