[slides] fastSTRUCTURE%3A Variational Inference of Population Structure in Large SNP Data Sets

fastSTRUCTURE is a variational Bayesian inference method for estimating population structure in large SNP datasets. It improves upon the STRUCTURE program by using a variational framework to achieve faster inference with accuracy comparable to ADMIXTURE. The method includes heuristic scores to determine the number of populations and a new hierarchical prior to detect weak structure. It is tested on simulated data and the CEPH–Human Genome Diversity Panel (HGDP) data, showing it is about 100 times faster than STRUCTURE and achieves similar accuracy to ADMIXTURE. The algorithm is freely available online. The paper discusses the challenges of inferring population structure in large genetic datasets and the limitations of existing methods. It introduces a variational Bayesian approach to approximate inference, which transforms the problem into an optimization task. This allows for faster computation and better scalability. The method uses a flexible prior distribution over hidden parameters and a heuristic score to choose model complexity, which helps in identifying the number of populations in the data. The paper also describes the generative model for population structure, including the assumptions about allele frequencies and admixture proportions. It outlines the variational inference framework, which approximates the true posterior distribution using a tractable family of distributions. The method uses the Kullback-Leibler divergence to minimize the difference between the true and variational distributions, leading to an optimization problem that can be solved efficiently. The paper evaluates the performance of fastSTRUCTURE on simulated data sets with varying population structures and numbers of populations. It compares the accuracy and runtime of fastSTRUCTURE with STRUCTURE and ADMIXTURE, showing that fastSTRUCTURE achieves comparable accuracy with significantly faster runtime. The method is also tested on the HGDP data set, where it successfully identifies population structure and provides reasonable estimates of ancestry proportions. The paper discusses the choice of model complexity K, proposing two metrics to select K based on the LLBO and the minimum number of populations needed to explain the data. It also addresses the issue of overfitting by using a logistic prior and multiple random restarts to estimate variational parameters. The results show that fastSTRUCTURE is effective in identifying population structure and provides robust estimates of ancestry proportions, even in cases of weak structure. The method is suitable for large datasets with hundreds of thousands of genetic variants and is computationally efficient.fastSTRUCTURE is a variational Bayesian inference method for estimating population structure in large SNP datasets. It improves upon the STRUCTURE program by using a variational framework to achieve faster inference with accuracy comparable to ADMIXTURE. The method includes heuristic scores to determine the number of populations and a new hierarchical prior to detect weak structure. It is tested on simulated data and the CEPH–Human Genome Diversity Panel (HGDP) data, showing it is about 100 times faster than STRUCTURE and achieves similar accuracy to ADMIXTURE. The algorithm is freely available online. The paper discusses the challenges of inferring population structure in large genetic datasets and the limitations of existing methods. It introduces a variational Bayesian approach to approximate inference, which transforms the problem into an optimization task. This allows for faster computation and better scalability. The method uses a flexible prior distribution over hidden parameters and a heuristic score to choose model complexity, which helps in identifying the number of populations in the data. The paper also describes the generative model for population structure, including the assumptions about allele frequencies and admixture proportions. It outlines the variational inference framework, which approximates the true posterior distribution using a tractable family of distributions. The method uses the Kullback-Leibler divergence to minimize the difference between the true and variational distributions, leading to an optimization problem that can be solved efficiently. The paper evaluates the performance of fastSTRUCTURE on simulated data sets with varying population structures and numbers of populations. It compares the accuracy and runtime of fastSTRUCTURE with STRUCTURE and ADMIXTURE, showing that fastSTRUCTURE achieves comparable accuracy with significantly faster runtime. The method is also tested on the HGDP data set, where it successfully identifies population structure and provides reasonable estimates of ancestry proportions. The paper discusses the choice of model complexity K, proposing two metrics to select K based on the LLBO and the minimum number of populations needed to explain the data. It also addresses the issue of overfitting by using a logistic prior and multiple random restarts to estimate variational parameters. The results show that fastSTRUCTURE is effective in identifying population structure and provides robust estimates of ancestry proportions, even in cases of weak structure. The method is suitable for large datasets with hundreds of thousands of genetic variants and is computationally efficient.

fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets

June 2014 | Anil Raj, Matthew Stephens, and Jonathan K. Pritchard