Identifying and removing haplotypic duplication in primary genome assemblies

Identifying and removing haplotypic duplication in primary genome assemblies

2020 | Dengfeng Guan, Shane A. McCarthy, Jonathan Wood, Kerstin Howe, Yadong Wang, Richard Durbin
A novel tool, purge_dups, has been developed to identify and remove haplotigs and heterozygous overlaps in primary genome assemblies. This tool uses sequence similarity and read depth to automatically detect and remove these duplications, improving assembly continuity while maintaining completeness. Compared to existing tools like purge_haplotigs and HaploMerger2, purge_dups is more effective at removing heterozygous duplication and is fully automatic, making it easy to integrate into assembly pipelines. The tool works by first mapping long-read sequencing data onto the assembly to determine read depth, then segmenting the draft assembly into contigs and generating self-alignments. Haplotigs are identified and removed, and overlaps are detected by analyzing consistent matches in the remaining data. Overlaps with low coverage are marked as heterozygous and removed. Purge_dups was tested on four primary assemblies: Arabidopsis thaliana, Anopheles coluzzi, Vitis vinifera, and Myripristis murdjan. It significantly reduced duplicated haploid-unique k-mers and improved the completeness of gene sets as measured by BUSCO scores. For the Myripristis murdjan assembly, purge_dups resulted in higher contig N50 values and better scaffolding compared to other tools. When assessed with QUAST, purge_dups scaffolds had the highest NGA50, indicating better alignment to the genome. The study highlights the importance of removing heterozygous overlaps in addition to haplotigs to improve genome assembly quality. Purge_dups is recommended for use after initial assembly, before scaffolding, as it does not require user-defined cutoffs and can be integrated into automated assembly pipelines. The tool is available at https://github.com/dfguan/purge_dups.A novel tool, purge_dups, has been developed to identify and remove haplotigs and heterozygous overlaps in primary genome assemblies. This tool uses sequence similarity and read depth to automatically detect and remove these duplications, improving assembly continuity while maintaining completeness. Compared to existing tools like purge_haplotigs and HaploMerger2, purge_dups is more effective at removing heterozygous duplication and is fully automatic, making it easy to integrate into assembly pipelines. The tool works by first mapping long-read sequencing data onto the assembly to determine read depth, then segmenting the draft assembly into contigs and generating self-alignments. Haplotigs are identified and removed, and overlaps are detected by analyzing consistent matches in the remaining data. Overlaps with low coverage are marked as heterozygous and removed. Purge_dups was tested on four primary assemblies: Arabidopsis thaliana, Anopheles coluzzi, Vitis vinifera, and Myripristis murdjan. It significantly reduced duplicated haploid-unique k-mers and improved the completeness of gene sets as measured by BUSCO scores. For the Myripristis murdjan assembly, purge_dups resulted in higher contig N50 values and better scaffolding compared to other tools. When assessed with QUAST, purge_dups scaffolds had the highest NGA50, indicating better alignment to the genome. The study highlights the importance of removing heterozygous overlaps in addition to haplotigs to improve genome assembly quality. Purge_dups is recommended for use after initial assembly, before scaffolding, as it does not require user-defined cutoffs and can be integrated into automated assembly pipelines. The tool is available at https://github.com/dfguan/purge_dups.
Reach us at info@futurestudyspace.com
[slides and audio] Identifying and removing haplotypic duplication in primary genome assemblies