March 24, 2010 | Joshua C. Denny1,2,*, Marylyn D. Ritchie3, Melissa A. Basford1, Jill M. Pulley1,2, Lisa Bastarache1, Kristin Brown-Gentry3, Deede Wang2, Dan R. Masys1, Dan M. Roden2 and Dana C. Crawford3
The study presents a novel method for phenome-wide association scans (PheWAS) using International Classification of Disease (ICD9) billing codes to identify gene-disease associations. The researchers developed a code translation table to define 776 disease populations and their controls using ICD9 codes from electronic medical records (EMRs). They genotyped 6005 European-American individuals in BioVU, a DNA biobank, at five single nucleotide polymorphisms (SNPs) associated with various diseases. The PheWAS software generated case and control populations for each SNP and analyzed disease-SNP associations. Four of seven known SNP-disease associations were replicated with P-values between 2.8×10⁻⁶ and 0.011. The algorithm also identified 19 previously unknown associations at P < 0.01. The study demonstrates that PheWAS is a feasible method for investigating SNP-disease associations. However, further validation is needed to determine the validity of these associations and appropriate statistical thresholds for clinical significance.
The study used ICD9 billing codes to approximate the clinical disease phenome. They developed custom groupings of ICD9 codes to better allow for large-scale genomic research. The algorithm was implemented as a PERL program that takes ICD9 codes, race/ethnicity, and genotypes as inputs. It translates ICD9 codes into diagnostic groups and finds associated controls. The output includes Excel spreadsheets and tab-delimited files summarizing case and control numbers, chi-square test statistics, P-values, and odds ratios. The program is freely available for download.
The study found that four of seven previously known SNP-disease associations were replicated, while three were not. The researchers reviewed the cases and found that some of the non-replicated associations may have been due to false positives. Repeating the study with true positive cases revealed that AF and SLE replicated previously reported associations, while CAS showed a trend. The study highlights the statistical challenges of analyzing high-dimensional data and the limitations of using ICD9 codes, which may have substantial limitations in sensitivity and specificity. The researchers suggest that combining billing codes, laboratory data, and natural language processing can improve phenotype identification.
The study demonstrates the feasibility of PheWAS for discovering genetic associations with a single locus at a time. Future research should investigate more accurate methods of automatic phenotypic determination and extend to include other phenotypic traits. The study also suggests that coupling PheWAS with GWAS analysis could lead to increasing statistical challenges. The researchers emphasize the importance of further study to determine the true significance level of clinical and genetic interest. The study highlights the potential of EMR-linked biobanks for future research and the need for community researchers to refine the schema into a more robust, etiologic lexicon of disease phenotypes.The study presents a novel method for phenome-wide association scans (PheWAS) using International Classification of Disease (ICD9) billing codes to identify gene-disease associations. The researchers developed a code translation table to define 776 disease populations and their controls using ICD9 codes from electronic medical records (EMRs). They genotyped 6005 European-American individuals in BioVU, a DNA biobank, at five single nucleotide polymorphisms (SNPs) associated with various diseases. The PheWAS software generated case and control populations for each SNP and analyzed disease-SNP associations. Four of seven known SNP-disease associations were replicated with P-values between 2.8×10⁻⁶ and 0.011. The algorithm also identified 19 previously unknown associations at P < 0.01. The study demonstrates that PheWAS is a feasible method for investigating SNP-disease associations. However, further validation is needed to determine the validity of these associations and appropriate statistical thresholds for clinical significance.
The study used ICD9 billing codes to approximate the clinical disease phenome. They developed custom groupings of ICD9 codes to better allow for large-scale genomic research. The algorithm was implemented as a PERL program that takes ICD9 codes, race/ethnicity, and genotypes as inputs. It translates ICD9 codes into diagnostic groups and finds associated controls. The output includes Excel spreadsheets and tab-delimited files summarizing case and control numbers, chi-square test statistics, P-values, and odds ratios. The program is freely available for download.
The study found that four of seven previously known SNP-disease associations were replicated, while three were not. The researchers reviewed the cases and found that some of the non-replicated associations may have been due to false positives. Repeating the study with true positive cases revealed that AF and SLE replicated previously reported associations, while CAS showed a trend. The study highlights the statistical challenges of analyzing high-dimensional data and the limitations of using ICD9 codes, which may have substantial limitations in sensitivity and specificity. The researchers suggest that combining billing codes, laboratory data, and natural language processing can improve phenotype identification.
The study demonstrates the feasibility of PheWAS for discovering genetic associations with a single locus at a time. Future research should investigate more accurate methods of automatic phenotypic determination and extend to include other phenotypic traits. The study also suggests that coupling PheWAS with GWAS analysis could lead to increasing statistical challenges. The researchers emphasize the importance of further study to determine the true significance level of clinical and genetic interest. The study highlights the potential of EMR-linked biobanks for future research and the need for community researchers to refine the schema into a more robust, etiologic lexicon of disease phenotypes.