Polygenic risk scores (PRS) summarize genetic effects across many markers that individually do not reach significance in large-scale association studies. These scores are constructed using a training sample to select markers and then applied to an independent replication sample to predict trait values or assess genetic associations. PRS have been used to detect genetic signals when no single marker is significant, to identify common genetic bases for related disorders, and to build risk prediction models. However, their predictive accuracy and statistical power depend on factors such as sample size, the proportion of genetic variance explained, and the method of weighting marker effects.
The power and predictive accuracy of PRS are derived from a quantitative genetics model, considering the sizes of the training and replication samples, the explained genetic variance, and the selection thresholds for including markers. Expressions are derived for both quantitative and discrete traits, with the latter allowing for case/control sampling. A novel approach to estimating the variance explained by a marker panel is also proposed. Published studies with significant PRS associations have been well powered, while negative results may be due to low sample sizes. Useful prediction levels may only be achieved with very large samples, up to an order of magnitude larger than currently available. Thus, PRS currently have more utility for association testing than for predicting complex traits, but prediction will become more feasible as sample sizes grow.
The first successful application of PRS to GWAS data was in schizophrenia, where a large number of markers were associated with disease in a second sample, suggesting a polygenic component. Similar results have been observed for other complex traits, including multiple sclerosis, height, cardiovascular risk, rheumatoid arthritis, and body mass index. However, studies on breast and prostate cancers have been inconclusive, possibly due to technical issues or small sample sizes. The present work aims to determine whether negative results in these studies could be explained by sample size or if a true lack of polygenic effect is the more likely explanation.
Polygenic scores must be estimated from a finite training sample, and their effectiveness for association testing and risk prediction depends on the precision of this estimation and the proportion of variation explained by the score. The role of sample size in this context has not been thoroughly considered. Several authors have expressed sensitivity and specificity in terms of genetic variance, but they did not distinguish the variance explained by an estimated predictor from that of the true predictor. Large samples lead to small sampling variance on individual marker effects, but errors accumulate across multiple markers, affecting the polygenic score.
Statistical properties of PRS analyses are derived from a quantitative genetics model as a function of explained genetic variance and sample sizes in discovery and replication samples. A range of options for constructing the score is considered, including estimation from a different trait, selection by P-values, and different weighting methods. The power is obtained for testing a PRS for association in a replication sample, and the correlation, mean square error, and AUC are obtained for a predictor estimated from a finite training sample. These results are usedPolygenic risk scores (PRS) summarize genetic effects across many markers that individually do not reach significance in large-scale association studies. These scores are constructed using a training sample to select markers and then applied to an independent replication sample to predict trait values or assess genetic associations. PRS have been used to detect genetic signals when no single marker is significant, to identify common genetic bases for related disorders, and to build risk prediction models. However, their predictive accuracy and statistical power depend on factors such as sample size, the proportion of genetic variance explained, and the method of weighting marker effects.
The power and predictive accuracy of PRS are derived from a quantitative genetics model, considering the sizes of the training and replication samples, the explained genetic variance, and the selection thresholds for including markers. Expressions are derived for both quantitative and discrete traits, with the latter allowing for case/control sampling. A novel approach to estimating the variance explained by a marker panel is also proposed. Published studies with significant PRS associations have been well powered, while negative results may be due to low sample sizes. Useful prediction levels may only be achieved with very large samples, up to an order of magnitude larger than currently available. Thus, PRS currently have more utility for association testing than for predicting complex traits, but prediction will become more feasible as sample sizes grow.
The first successful application of PRS to GWAS data was in schizophrenia, where a large number of markers were associated with disease in a second sample, suggesting a polygenic component. Similar results have been observed for other complex traits, including multiple sclerosis, height, cardiovascular risk, rheumatoid arthritis, and body mass index. However, studies on breast and prostate cancers have been inconclusive, possibly due to technical issues or small sample sizes. The present work aims to determine whether negative results in these studies could be explained by sample size or if a true lack of polygenic effect is the more likely explanation.
Polygenic scores must be estimated from a finite training sample, and their effectiveness for association testing and risk prediction depends on the precision of this estimation and the proportion of variation explained by the score. The role of sample size in this context has not been thoroughly considered. Several authors have expressed sensitivity and specificity in terms of genetic variance, but they did not distinguish the variance explained by an estimated predictor from that of the true predictor. Large samples lead to small sampling variance on individual marker effects, but errors accumulate across multiple markers, affecting the polygenic score.
Statistical properties of PRS analyses are derived from a quantitative genetics model as a function of explained genetic variance and sample sizes in discovery and replication samples. A range of options for constructing the score is considered, including estimation from a different trait, selection by P-values, and different weighting methods. The power is obtained for testing a PRS for association in a replication sample, and the correlation, mean square error, and AUC are obtained for a predictor estimated from a finite training sample. These results are used