July 24, 2018 | Ryan Poplin, Valentin Ruano-Rubio, Mark A. DePristo, Tim J. Fennell, Mauricio O. Carneiro, Geraldine A. Van der Auwera, David E. Kling, Laura D. Gauthier, Ami Levy-Moonshine, David Roazen, Khalid Shakir, Joel Thibault, Sheila Chandran, Chris Whelan, Monkol Lek, Stacey Gabriel, Mark J Daly, Ben Neale, Daniel G. MacArthur, Eric Banks
The paper introduces a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), which efficiently and accurately detects all classes of genetic variation across tens to hundreds of thousands of human samples. The HC-RCM algorithm independently determines genotype likelihoods per sample while performing joint calling across all samples within a project. The authors demonstrate that the HC-RCM scales efficiently to very large sample sizes without loss in accuracy, outperforming other algorithms in indel variant calling. The HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position, making it suitable for population genetics and clinical studies. The algorithm's scalability is crucial for joint calling in large cohorts, as it linearly increases runtime with the number of samples, unlike other algorithms which increase superlinearly. The HC-RCM has been successfully applied to produce a joint call set with over 90,000 exome samples for the Exome Aggregation Consortium (ExAC).The paper introduces a novel assembly-based approach to variant calling, the GATK HaplotypeCaller (HC) and Reference Confidence Model (RCM), which efficiently and accurately detects all classes of genetic variation across tens to hundreds of thousands of human samples. The HC-RCM algorithm independently determines genotype likelihoods per sample while performing joint calling across all samples within a project. The authors demonstrate that the HC-RCM scales efficiently to very large sample sizes without loss in accuracy, outperforming other algorithms in indel variant calling. The HC-RCM produces a fully squared-off matrix of genotypes across all samples at every genomic position, making it suitable for population genetics and clinical studies. The algorithm's scalability is crucial for joint calling in large cohorts, as it linearly increases runtime with the number of samples, unlike other algorithms which increase superlinearly. The HC-RCM has been successfully applied to produce a joint call set with over 90,000 exome samples for the Exome Aggregation Consortium (ExAC).