July 24, 2018 | Ryan Poplin, Valentin Ruano-Rubio, Mark A. DePristo, Tim J. Fennell, Mauricio O. Carneiro, Geraldine A. Van der Auwera, David E. Kligler, Laura D. Gauthier, Ami Levy-Moonshine, David Roazen, Khalid Shakir, Joel Thibault, Sheila Chandran, Chris Whelan, Monkol Lek, Stacey Gabriel, Mark J Daly, Ben Neale, Daniel G. MacArthur, and Eric Banks
The paper presents a novel variant calling algorithm called GATK HaplotypeCaller (HC) combined with a Reference Confidence Model (RCM), which enables accurate and efficient detection of genetic variants across tens of thousands of samples. The HC-RCM algorithm determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. It was tested on over 90,000 samples from the Exome Aggregation Consortium (ExAC) and demonstrated superior accuracy in calling indel variants compared to other algorithms. The algorithm produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated, which is crucial for accurate population allele frequency estimation. The HC-RCM is a scalable, assembly-based algorithm with applications in population genetics and clinical studies. The algorithm uses a pair-HMM to calculate genotype likelihoods and is designed to handle large sample sizes efficiently. The study shows that the HC-RCM outperforms other variant calling algorithms in terms of sensitivity and specificity, particularly for indel variants. The algorithm's ability to scale to large sample sizes is demonstrated by its application to over 90,000 exomes in the ExAC study. The paper also discusses the importance of joint calling in variant discovery and the challenges of scaling variant calling algorithms to large datasets. The HC-RCM addresses these challenges by reducing computational complexity and improving accuracy. The algorithm's performance is validated using the Genome in a Bottle (GiaB) standard, which contains validated variants. The study concludes that the HC-RCM is a powerful tool for variant discovery and has the potential to significantly improve the accuracy and efficiency of genetic variant detection in large-scale studies.The paper presents a novel variant calling algorithm called GATK HaplotypeCaller (HC) combined with a Reference Confidence Model (RCM), which enables accurate and efficient detection of genetic variants across tens of thousands of samples. The HC-RCM algorithm determines genotype likelihoods independently per-sample but performs joint calling across all samples within a project simultaneously. It was tested on over 90,000 samples from the Exome Aggregation Consortium (ExAC) and demonstrated superior accuracy in calling indel variants compared to other algorithms. The algorithm produces a fully squared-off matrix of genotypes across all samples at every genomic position being investigated, which is crucial for accurate population allele frequency estimation. The HC-RCM is a scalable, assembly-based algorithm with applications in population genetics and clinical studies. The algorithm uses a pair-HMM to calculate genotype likelihoods and is designed to handle large sample sizes efficiently. The study shows that the HC-RCM outperforms other variant calling algorithms in terms of sensitivity and specificity, particularly for indel variants. The algorithm's ability to scale to large sample sizes is demonstrated by its application to over 90,000 exomes in the ExAC study. The paper also discusses the importance of joint calling in variant discovery and the challenges of scaling variant calling algorithms to large datasets. The HC-RCM addresses these challenges by reducing computational complexity and improving accuracy. The algorithm's performance is validated using the Genome in a Bottle (GiaB) standard, which contains validated variants. The study concludes that the HC-RCM is a powerful tool for variant discovery and has the potential to significantly improve the accuracy and efficiency of genetic variant detection in large-scale studies.