This paper introduces a Bayesian statistical framework for detecting haplotypes from short-read sequencing data, which is implemented in the tool FreeBayes. The framework allows for the modeling of multiallelic loci and non-uniform copy numbers across individuals, improving the accuracy of variant detection. Traditional methods for variant detection often assume biallelic loci and uniform copy numbers, which can limit their ability to detect small variants on sex chromosomes, in polyploid organisms, or in regions with copy-number variations. The new framework addresses these limitations by incorporating Bayesian inference to estimate the probability of genotypes given sequencing data and prior allele frequencies.
The framework uses a Bayesian approach to calculate the probability of a genotype given sequencing observations, accounting for sequencing errors and mapping quality. It also incorporates prior probabilities of allele frequencies using Ewens' sampling formula, which models the distribution of allele frequencies in a population. This allows for more accurate estimation of allele frequencies and improves the detection of rare variants.
The paper describes the implementation of this framework in FreeBayes, which directly detects haplotypes from short-read sequencing data. The method uses dynamic windowing to identify regions of the genome with multiple segregating alleles and assembles haplotype observations from aligned reads. These observations are then used to estimate the probability of polymorphism at a locus and to determine the most likely genotypes for each sample.
The framework also incorporates genotype imputation, which improves the accuracy of genotyping by considering the likelihood of genotypes across multiple samples. This allows for the determination of marginal genotype likelihoods and provides a more accurate estimate of the quality of genotypes. The method is efficient and can be applied to large datasets, making it suitable for population-level inference. The paper concludes that the Bayesian framework provides a robust and accurate method for detecting haplotypes and variants from short-read sequencing data.This paper introduces a Bayesian statistical framework for detecting haplotypes from short-read sequencing data, which is implemented in the tool FreeBayes. The framework allows for the modeling of multiallelic loci and non-uniform copy numbers across individuals, improving the accuracy of variant detection. Traditional methods for variant detection often assume biallelic loci and uniform copy numbers, which can limit their ability to detect small variants on sex chromosomes, in polyploid organisms, or in regions with copy-number variations. The new framework addresses these limitations by incorporating Bayesian inference to estimate the probability of genotypes given sequencing data and prior allele frequencies.
The framework uses a Bayesian approach to calculate the probability of a genotype given sequencing observations, accounting for sequencing errors and mapping quality. It also incorporates prior probabilities of allele frequencies using Ewens' sampling formula, which models the distribution of allele frequencies in a population. This allows for more accurate estimation of allele frequencies and improves the detection of rare variants.
The paper describes the implementation of this framework in FreeBayes, which directly detects haplotypes from short-read sequencing data. The method uses dynamic windowing to identify regions of the genome with multiple segregating alleles and assembles haplotype observations from aligned reads. These observations are then used to estimate the probability of polymorphism at a locus and to determine the most likely genotypes for each sample.
The framework also incorporates genotype imputation, which improves the accuracy of genotyping by considering the likelihood of genotypes across multiple samples. This allows for the determination of marginal genotype likelihoods and provides a more accurate estimate of the quality of genotypes. The method is efficient and can be applied to large datasets, making it suitable for population-level inference. The paper concludes that the Bayesian framework provides a robust and accurate method for detecting haplotypes and variants from short-read sequencing data.