[slides and audio] Data quality control in genetic case-control association studies

This protocol outlines the critical steps for data quality control (QC) in genetic case-control association studies, emphasizing the importance of identifying and removing biased DNA samples and markers to ensure the accuracy of the study results. The protocol covers both genome-wide association (GWA) studies and candidate gene association studies, detailing the use of tools like PLINK for assessing individual and marker-level QC. Key steps include: 1. **Per-Individual QC**: - Identifying individuals with discordant sex information. - Detecting individuals with elevated missing genotype or heterozygosity rates. - Identifying duplicated or related individuals. - Identifying individuals of divergent ancestry using principal component analysis (PCA). 2. **Per-Marker QC**: - Identifying SNPs with excessive missing genotype rates. - Testing SNPs for deviation from Hardy-Weinberg equilibrium (HWE). - Identifying SNPs with significantly different missing genotype rates between cases and controls. - Removing SNPs with very low minor allele frequency (MAF). The protocol emphasizes the importance of manual inspection of genotype cluster plots to ensure the robustness of genotype calls. It also highlights the need to address population stratification, which can introduce bias in GWA studies. The protocol is designed to be completed in approximately 8 hours and is illustrated using simulated datasets.This protocol outlines the critical steps for data quality control (QC) in genetic case-control association studies, emphasizing the importance of identifying and removing biased DNA samples and markers to ensure the accuracy of the study results. The protocol covers both genome-wide association (GWA) studies and candidate gene association studies, detailing the use of tools like PLINK for assessing individual and marker-level QC. Key steps include: 1. **Per-Individual QC**: - Identifying individuals with discordant sex information. - Detecting individuals with elevated missing genotype or heterozygosity rates. - Identifying duplicated or related individuals. - Identifying individuals of divergent ancestry using principal component analysis (PCA). 2. **Per-Marker QC**: - Identifying SNPs with excessive missing genotype rates. - Testing SNPs for deviation from Hardy-Weinberg equilibrium (HWE). - Identifying SNPs with significantly different missing genotype rates between cases and controls. - Removing SNPs with very low minor allele frequency (MAF). The protocol emphasizes the importance of manual inspection of genotype cluster plots to ensure the robustness of genotype calls. It also highlights the need to address population stratification, which can introduce bias in GWA studies. The protocol is designed to be completed in approximately 8 hours and is illustrated using simulated datasets.

Data quality control in genetic case-control association studies

2010 September | Carl A. Anderson, Fredrik H Pettersson, Geraldine M Clarke, Lon R Cardon, Andrew P. Morris, and Krina T. Zondervan