HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

HapSolo: an optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding

2021 | Edwin A. Solares, Yuan Tao, Anthony D. Long and Brandon S. Gaut
HapSolo is an optimization method for removing secondary haplotigs during diploid genome assembly and scaffolding. The method identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. The method was tested on genome data from three species: the Chardonnay grape (Vitis vinifera), a mosquito (Anopheles funestus) and the Thorny Skate (Amblyraja radiata). HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by downstream analyses, which were illustrated by scaffolding with Hi-C data. HapSolo was compared to PurgeDups, with generally superior results for HapSolo. HapSolo is implemented in Python and is freely available. The method is effective for reducing genome size and improving assembly quality, particularly in highly heterozygous genomes. It is also efficient for downstream scaffolding, such as Hi-C scaffolding. HapSolo is applicable to any contig assembly from any assembler and any sequencing type. The method is flexible and can be adapted to different datasets and parameters. It is computationally efficient and can be run on a laptop or desktop computer. HapSolo is a valuable tool for improving genome assembly and scaffolding, particularly for diploid genomes.HapSolo is an optimization method for removing secondary haplotigs during diploid genome assembly and scaffolding. The method identifies secondary contigs and defines a primary assembly based on multiple pairwise contig alignment metrics. HapSolo evaluates candidate primary assemblies using BUSCO scores and then distinguishes among candidate assemblies using a cost function. The cost function can be defined by the user but by default considers the number of missing, duplicated and single BUSCO genes within the assembly. HapSolo performs hill climbing to minimize cost over thousands of candidate assemblies. The method was tested on genome data from three species: the Chardonnay grape (Vitis vinifera), a mosquito (Anopheles funestus) and the Thorny Skate (Amblyraja radiata). HapSolo rapidly identified candidate assemblies that yield improvements in assembly metrics, including decreased genome size and improved N50 scores. Contig N50 scores improved by 35%, 9% and 9% for Chardonnay, mosquito and the thorny skate, respectively, relative to unreduced primary assemblies. The benefits of HapSolo were amplified by downstream analyses, which were illustrated by scaffolding with Hi-C data. HapSolo was compared to PurgeDups, with generally superior results for HapSolo. HapSolo is implemented in Python and is freely available. The method is effective for reducing genome size and improving assembly quality, particularly in highly heterozygous genomes. It is also efficient for downstream scaffolding, such as Hi-C scaffolding. HapSolo is applicable to any contig assembly from any assembler and any sequencing type. The method is flexible and can be adapted to different datasets and parameters. It is computationally efficient and can be run on a laptop or desktop computer. HapSolo is a valuable tool for improving genome assembly and scaffolding, particularly for diploid genomes.
Reach us at info@futurestudyspace.com
[slides] XSEDE%3A Accelerating Scientific Discovery | StudySpace