23 January 2020 | Dengfeng Guan, Shane A. McCarthy, Jonathan Wood, Kerstin Howe, Yadong Wang, Richard Durbin
The paper introduces a novel tool called *purge_dups* designed to identify and remove haplotypic duplication and heterozygous overlaps in primary genome assemblies. The motivation for this tool stems from the rapid development of long-read sequencing and scaffolding technologies, which can lead to the creation of multiple copies of regions in high heterozygosity, disrupting contiguity and affecting downstream processes like gene annotation. Current tools either focus on removing contained duplicate regions (haplotigs) or fail to utilize all relevant information, leading to errors.
*purge_dups* uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. The tool is evaluated on four primary assemblies (Arabidopsis thaliana, Anopheles coluzzii, Vitis vinifera, and Myripristis murdjan) and compared with existing tools like *purge_haplotigs* and *HaploMerger2*. The results show that *purge_dups* reduces heterozygous duplication more effectively, increases assembly continuity, and maintains the completeness of the primary assembly. Additionally, *purge_dups* is fully automatic and can be easily integrated into assembly pipelines.
The authors recommend using *purge_dups* directly after initial assembly, prior to scaffolding, to improve contiguity and avoid false joins caused by heterozygous overlaps. The tool is available in C and can be accessed at https://github.com/dfguan/purge_dups.The paper introduces a novel tool called *purge_dups* designed to identify and remove haplotypic duplication and heterozygous overlaps in primary genome assemblies. The motivation for this tool stems from the rapid development of long-read sequencing and scaffolding technologies, which can lead to the creation of multiple copies of regions in high heterozygosity, disrupting contiguity and affecting downstream processes like gene annotation. Current tools either focus on removing contained duplicate regions (haplotigs) or fail to utilize all relevant information, leading to errors.
*purge_dups* uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. The tool is evaluated on four primary assemblies (Arabidopsis thaliana, Anopheles coluzzii, Vitis vinifera, and Myripristis murdjan) and compared with existing tools like *purge_haplotigs* and *HaploMerger2*. The results show that *purge_dups* reduces heterozygous duplication more effectively, increases assembly continuity, and maintains the completeness of the primary assembly. Additionally, *purge_dups* is fully automatic and can be easily integrated into assembly pipelines.
The authors recommend using *purge_dups* directly after initial assembly, prior to scaffolding, to improve contiguity and avoid false joins caused by heterozygous overlaps. The tool is available in C and can be accessed at https://github.com/dfguan/purge_dups.