Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing

Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing

June 3, 2016 | Chen-Shan Chin, Paul Peluso, Fritz J. Sedlazeck, Maria Nattestad, Gregory T. Concepcion, Alicia Clum, Christopher Dunn, Ronan O’Malley, Rosa Figueroa-Balderas, Abraham Morales-Cruz, Grant R. Cramer, Massimo Delledonne, Chongyuan Luo, Joseph R. Ecker, Dario Cantu, David R. Rank, Michael C. Schatz
This paper introduces FALCON and FALCON-Unzip, open-source algorithms for assembling Single Molecule Real-Time (SMRT) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. The methods address the challenge of assembling heterozygous or rearranged genomes, which are difficult for short-read assembly approaches. The algorithms were tested on three heterozygous samples: an F1 hybrid of Arabidopsis thaliana, the grapevine cultivar Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata. The FALCON-based assemblies were more contiguous and complete than alternative short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between homologous chromosomes, including identifying widespread heterozygous structural variations within coding sequences. The FALCON assembler follows the design of the Hierarchical Genome Assembly Process (HGAP), but with optimized components for each step. It begins by error-correcting PacBio raw sequence data through long-read to long-read sequence alignments and constructs a string graph of overlapping reads. This process generates sets of "haplotype-fused" contigs and variant sequence "bubbles" representing major structural variations and divergent regions between homologous sequences. The associated tool, FALCON-Unzip, analyzes these contigs to find heterozygous variants, such as SNPs, within the contigs. It uses phasing information to group reads into different phasing blocks and haplotypes, then re-assembles them into haplotigs, which are integrated with the initial "haplotype-fused" contigs to produce the diploid assembly. The FALCON-Unzip algorithm was applied to assemble the genomes of Arabidopsis thaliana inbred lines, the F1 hybrid, and the grapevine and coral fungus. The assemblies were evaluated for accuracy and completeness using various metrics, including nucleotide sequence accuracy, BUSCO analysis, and structural variation identification. The FALCON-Unzip assemblies were found to be more contiguous and accurate than alternative methods, with high-quality haplotype-specific sequence information. The results demonstrated that FALCON-Unzip can effectively resolve haplotypes and structural variations in heterozygous genomes, providing a more accurate representation of the genome than previous methods. The study highlights the potential of FALCON-Unzip for assembling diploid genomes, particularly in species with high heterozygosity or complex genomic structures.This paper introduces FALCON and FALCON-Unzip, open-source algorithms for assembling Single Molecule Real-Time (SMRT) Sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. The methods address the challenge of assembling heterozygous or rearranged genomes, which are difficult for short-read assembly approaches. The algorithms were tested on three heterozygous samples: an F1 hybrid of Arabidopsis thaliana, the grapevine cultivar Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata. The FALCON-based assemblies were more contiguous and complete than alternative short or long-read approaches. The phased diploid assembly enabled the study of haplotype structures and heterozygosities between homologous chromosomes, including identifying widespread heterozygous structural variations within coding sequences. The FALCON assembler follows the design of the Hierarchical Genome Assembly Process (HGAP), but with optimized components for each step. It begins by error-correcting PacBio raw sequence data through long-read to long-read sequence alignments and constructs a string graph of overlapping reads. This process generates sets of "haplotype-fused" contigs and variant sequence "bubbles" representing major structural variations and divergent regions between homologous sequences. The associated tool, FALCON-Unzip, analyzes these contigs to find heterozygous variants, such as SNPs, within the contigs. It uses phasing information to group reads into different phasing blocks and haplotypes, then re-assembles them into haplotigs, which are integrated with the initial "haplotype-fused" contigs to produce the diploid assembly. The FALCON-Unzip algorithm was applied to assemble the genomes of Arabidopsis thaliana inbred lines, the F1 hybrid, and the grapevine and coral fungus. The assemblies were evaluated for accuracy and completeness using various metrics, including nucleotide sequence accuracy, BUSCO analysis, and structural variation identification. The FALCON-Unzip assemblies were found to be more contiguous and accurate than alternative methods, with high-quality haplotype-specific sequence information. The results demonstrated that FALCON-Unzip can effectively resolve haplotypes and structural variations in heterozygous genomes, providing a more accurate representation of the genome than previous methods. The study highlights the potential of FALCON-Unzip for assembling diploid genomes, particularly in species with high heterozygosity or complex genomic structures.
Reach us at info@study.space
Understanding Phased diploid genome assembly with single-molecule real-time sequencing