March 20, 2018 | Ryan Poplin¹², Pi-Chuan Chang², David Alexander², Scott Schwartz², Thomas Colthurst², Alexander Ku², Dan Newburger¹, Jojo Dijamco¹, Nam Nguyen¹, Pegah T. Afshar¹, Sam S. Gross¹, Lizzie Dorfman¹², Cory Y. McLean¹², Mark A. DePristo*¹²
DeepVariant is a deep learning-based variant caller that uses neural networks to identify genetic variations in next-generation sequencing (NGS) data. It outperforms existing tools, including winning the "highest performance" award for SNPs in a FDA-administered variant calling challenge. The system uses a deep convolutional neural network (CNN) to learn statistical relationships between read pileups and ground-truth genotype calls, enabling it to generalize across genome builds and even other mammalian species. Unlike traditional tools that rely on hand-crafted statistical models, DeepVariant uses a single deep learning model to call variants in various sequencing technologies and experimental designs, including whole-genome sequencing from 10X Genomics and Ion Ampliseq exomes.
DeepVariant's approach is more accurate and consistent across different quality metrics compared to existing tools. It performs well on Illumina sequencing data and has been shown to generalize well to other sequencing technologies, including SOLID and PacBio, which have high error rates in candidate callsets. DeepVariant also performs well on exome datasets, even with low initial PPVs, and shows significant improvements in PPV after retraining. The system is robust to changes in sequencing depth, preparation protocol, instrument type, genome build, and even mammalian species, making it useful for resequencing projects in non-human species that often lack ground truth data.
DeepVariant was tested on a variety of datasets, including the Genome in a Bottle benchmark, and outperformed other bioinformatics methods in terms of accuracy and error reduction. It demonstrated a more than 50% reduction in total number of errors per genome compared to the next best algorithm. DeepVariant's performance was also validated in a blinded sample submission to the Food and Drug Administration-sponsored variant calling Truth Challenge, where it won the "highest performance" award for SNPs.
The system's ability to learn from a wide range of sequencing technologies and experimental designs makes it a significant step towards more automated deep learning approaches for variant calling. DeepVariant represents a shift from expert-driven statistical modeling to machine learning-based methods, enabling more accurate and consistent variant calling across diverse sequencing platforms.DeepVariant is a deep learning-based variant caller that uses neural networks to identify genetic variations in next-generation sequencing (NGS) data. It outperforms existing tools, including winning the "highest performance" award for SNPs in a FDA-administered variant calling challenge. The system uses a deep convolutional neural network (CNN) to learn statistical relationships between read pileups and ground-truth genotype calls, enabling it to generalize across genome builds and even other mammalian species. Unlike traditional tools that rely on hand-crafted statistical models, DeepVariant uses a single deep learning model to call variants in various sequencing technologies and experimental designs, including whole-genome sequencing from 10X Genomics and Ion Ampliseq exomes.
DeepVariant's approach is more accurate and consistent across different quality metrics compared to existing tools. It performs well on Illumina sequencing data and has been shown to generalize well to other sequencing technologies, including SOLID and PacBio, which have high error rates in candidate callsets. DeepVariant also performs well on exome datasets, even with low initial PPVs, and shows significant improvements in PPV after retraining. The system is robust to changes in sequencing depth, preparation protocol, instrument type, genome build, and even mammalian species, making it useful for resequencing projects in non-human species that often lack ground truth data.
DeepVariant was tested on a variety of datasets, including the Genome in a Bottle benchmark, and outperformed other bioinformatics methods in terms of accuracy and error reduction. It demonstrated a more than 50% reduction in total number of errors per genome compared to the next best algorithm. DeepVariant's performance was also validated in a blinded sample submission to the Food and Drug Administration-sponsored variant calling Truth Challenge, where it won the "highest performance" award for SNPs.
The system's ability to learn from a wide range of sequencing technologies and experimental designs makes it a significant step towards more automated deep learning approaches for variant calling. DeepVariant represents a shift from expert-driven statistical modeling to machine learning-based methods, enabling more accurate and consistent variant calling across diverse sequencing platforms.