25 January 2024 | James M. Holt, Christopher T. Saunders, William J. Rowell, Zev Kronenberg, Aaron M. Wenger, Michael Eberle
HiPhase is a new tool for jointly phasing small, structural, and tandem repeat variants from HiFi sequencing data in diploid organisms. It addresses the limitations of existing phasing tools that typically only phase small variants. HiPhase uses two novel approaches: dual mode allele assignment and a phasing algorithm based on the A* search algorithm. It offers benefits such as no data down-sampling, support for multi-allelic variation, logic to span coverage gaps, innate multi-threading, built-in statistics gathering, and haplotagging. HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. It also jointly phases SNVs, indels, structural, and tandem repeat variants. HiPhase breaks the phasing problem into three major components: phase block generation, allele assignment, and haplotype solving. Phase block generation identifies pairs of consecutive heterozygous variant calls connected by at least one read mapping. Allele assignment converts read mappings into condensed allelic observations, which are then used for diplotype solving. Diplotype solving uses the A* search algorithm to find the optimal diplotype. HiPhase is available as source code and a pre-compiled Linux binary at https://github.com/PacificBiosciences/HiPhase. It is the first phasing tool to jointly phase SNVs, indels, structural, and tandem repeat variants. Compared to WhatsHap, HiPhase generates longer phase blocks with fewer phasing errors, phases more total variants, and fully phases more genes. HiPhase also includes usability features such as innate multi-threading, statistics gathering, and concurrent phased alignment output generation. Future versions will extend HiPhase to other forms of structural variants. HiPhase is developed by PacBio and is available for use.HiPhase is a new tool for jointly phasing small, structural, and tandem repeat variants from HiFi sequencing data in diploid organisms. It addresses the limitations of existing phasing tools that typically only phase small variants. HiPhase uses two novel approaches: dual mode allele assignment and a phasing algorithm based on the A* search algorithm. It offers benefits such as no data down-sampling, support for multi-allelic variation, logic to span coverage gaps, innate multi-threading, built-in statistics gathering, and haplotagging. HiPhase produced an average phase block NG50 of 480 kb with 929 switchflip errors and fully phased 93.8% of genes, improving over the current state of the art. It also jointly phases SNVs, indels, structural, and tandem repeat variants. HiPhase breaks the phasing problem into three major components: phase block generation, allele assignment, and haplotype solving. Phase block generation identifies pairs of consecutive heterozygous variant calls connected by at least one read mapping. Allele assignment converts read mappings into condensed allelic observations, which are then used for diplotype solving. Diplotype solving uses the A* search algorithm to find the optimal diplotype. HiPhase is available as source code and a pre-compiled Linux binary at https://github.com/PacificBiosciences/HiPhase. It is the first phasing tool to jointly phase SNVs, indels, structural, and tandem repeat variants. Compared to WhatsHap, HiPhase generates longer phase blocks with fewer phasing errors, phases more total variants, and fully phases more genes. HiPhase also includes usability features such as innate multi-threading, statistics gathering, and concurrent phased alignment output generation. Future versions will extend HiPhase to other forms of structural variants. HiPhase is developed by PacBio and is available for use.