Introduction

The phenome-wide association study (PheWAS) design is emerging as a complementary/alternative approach to the genome-wide association study (GWAS).1 This is driven in part by challenges in interpreting GWAS results. GWASs often fail to identify clinically significant associations and it can be equally challenging to characterize biological function when a significant fraction of single-nucleotide polymorphisms (SNPs) identified by GWASs are tag SNPs in intergenic regions with no known function.2 The PheWAS methodology introduces a paradigm shift by considering the phenome to be as useful as the genome when discovering gene–disease associations. Whereas GWAS associates a phenotype with genotypes across the genome, PheWAS associates a genotype with phenotypes across the phenome. For the majority of PheWASs, phenomes have been defined by structured data in an electronic medical record (EMR) system; specifically by International Classification of Disease version 9 (ICD9) codes. In the United States, ICD9 codes are used as diagnostic codes for billing purposes. ICD9 codes are limited by variable positive and negative predictive values for describing disease phenotypes, but offer an extremely efficient mechanism to broadly capture patient histories for thousands of diseases at varying levels of phenotypic resolution. Nearly 17 000 possible phenotypes can be differentiated by the ICD9 coding system,3 and SNPs can be associated with each ICD9 code across the phenome.

The first PheWAS was published by Denny et al4 in 2010 and focused on five disease-associated SNPs identified by GWASs. In this proof-of-principle study, each SNP was associated with hundreds of phenotypes (ICD9 codes) across the phenome and was capable of identifying expected associations when using a genotype-to-phenotype approach. Since this proof-of-principle study, other groups have assessed previously reported GWAS SNPs.5, 6, 7, 8 The use of SNPs identified by GWAS for PheWAS has the advantage of leveraging known association data when interpreting PheWAS results. For example, PheWAS results demonstrate that multiple sclerosis (MS) and erythematous conditions, including rosacea, may share a common genetic etiology (ie, HLA-DRB1*1501),4, 6 demonstrating the capacity of PheWAS to identify pleiotropic effects for GWAS SNPs.

A challenge with using GWAS SNPs for PheWAS is the lack of biological/functional data associated with those SNPs. An alternative PheWAS approach may be to focus on variants with expected function, such as stop-gain and stop-loss variants. A stop-gain variant, or nonsense variant, introduces a premature stop codon in the mRNA coding sequence, whereas a stop-loss variant disrupts a current stop codon. Stop-gain variants may deactivate a protein by altering protein stability, result in the deletion of important protein domains, and/or cause reduced mRNA expression because of nonsense-mediated mRNA decay mechanisms.9 In genetic association studies, nonsense SNPs have a higher probability of being associated with a phenotype with often higher effect sizes, compared with other classes of variation (eg, missense SNPs).10 Furthermore, nonsense mutations represent a class of variation that explains a significant proportion of Mendelian diseases,11 and nonsense variants of unknown clinical significance may have similar evolutionary selective pressures as known disease-causing mutations.12 The biological significance of stop-loss variants is less well characterized. These variants may cause loss of function by disrupting protein stability or could result in the gain of function by the addition of extra amino acids.

In this study, we test the hypothesis that presumed functional variants may be associated with human disease and that these associations can be identified by PheWASs. Focusing on functional variants in PheWAS, primarily loss-of-function variants, is analogous to the type of reverse genetics experiment normally reserved for an animal model system (eg, transgenic knockout mouse). Whereas the use of GWAS SNPs in PheWAS leverages known phenotypic data, concentrating on functional variants leverages known biological insights.

Materials and methods

Ethics statement

This study was approved by the Marshfield Clinic Institutional Review Board in Marshfield, Wisconsin, and written and informed consent was acquired for all participants.

Patient population

Genotyped samples have been described elsewhere13, 14 and have been applied to a previously reported PheWAS.6 Briefly, all individuals genotyped are self-identified white/non-Hispanic Marshfield Clinic patients recruited into the Personalized Medicine Research Project (PMRP). PMRP represents a homogenous population with 77% of participants claiming German ancestry.13 In this PheWAS, 4235 patients were included in a discovery set and 10 640 patients were included in a validation set. The 4235 patients included in the discovery set were all over age 50 years (mean 74 years), have on average over 30 years of data in the EMR, and were originally selected as a subpopulation of PMRP to examine genetic associations with high-density lipoprotein levels or cataract disease.15 Importantly, all 4235 samples have been genotyped with Illumina 660W SNP chip. The 10 640 patient validation set represents all additional PMRP patients over the age of 40 years. The validation subset is younger (mean 59 years of age), but with comparable years of EMR data.

Phenome definition

The phenome was defined by patient EMR data as described previously.6 Briefly, ICD9 codes were used to define cases and controls at varying levels of phenotypic resolution (eg, ICD9 695, 695.1, 695.11). Patients coded for any one specific code became a ‘case’ for that code, whereas those not coded for any one specific code became a ‘control.’ Based on prevalence of ICD9 coding in the population, ‘rule-of-two’ was used to define cases for common conditions (>300 cases). Rule-of-two requires a patient to be coded two or more times to be considered a case. Because of privacy concerns, phenotypes observed less than eight times in the cohort were excluded. A total of 4841 phenotypes/ICD9 codes defined the phenome.

Statistical analysis

A total of 105 stop-gain/loss SNPs and 5 PheWAS control SNPs were genotyped in the discovery set (see Supplementary Table 1 for a list of candidate SNPs and a list of changes observed for all rs-numbers described in the study). Candidate SNPs were selected based on potential clinical relevance or availability of pre-existing genotype data. For candidate SNPs where direct genotype data were unavailable, an appropriate tag SNP was used (r2>0.9). A total of 31 stop-gain/loss SNPs with the strongest biological and/or statistical PheWAS results were further genotyped in the validation set. Additional details on the SNP selection and genotyping methods are available in Supplementary Methods. Each SNP was associated across the phenome.

For common ICD9 codes, logistic regression analyses were used. Both adjusted and unadjusted models were investigated. Covariates included sex and years of EMR data. Age was not included as a covariate because of potential confounding effects that may be unique to this population (Supplementary Material and Supplementary Figures S1 and S2). For rare ICD9 codes where cell counts for a genotype fell below five in a contingency table, Fisher’s exact test was used, similar to methods described in previous PheWASs.4, 6 Q–Q plots were generated for every SNP (data not shown) and at the study-wide level (Supplementary Material and Supplementary Figure S3) to measure systematic confounding or bias in the SNP–phenotype associations. The PheWAS results from the discovery set and validation set were combined by meta-analysis. Meta-analysis was implemented using a random effect model and pooling of studies using the Mantel–Haenszel method. Heterogeneity was measured by Q and I2. All analyses were conducted using the R statistical software package (www.r-project.org, Vienna, Austria).

Results

Phenomes

The discovery set phenome included 4841 ICD9 codes documented for 4235 patients with extensive genetic data available through the PMRP. PMRP patients are primarily Caucasian adult patients receiving care in the Marshfield Clinic system,13 and the discovery set represents PMRP patients over the age of 50 years. A validation set was also drawn from the PMRP cohort and represented all additional PMRP patients over the age of 40 years. Despite being drawn from PMRP, the discovery and validation sets differed by case size and age. The mean and median case size for each phenotype in the discovery set was 217 and 59, respectively, compared with 312 and 100, respectively, in the larger validation set. The average current age in the discovery set was 74 years. In addition, the average current age in the validation set was 59 years. The average age difference between cases and unaffected controls for all ICD9 codes was larger for the older discovery set compared with the younger validation set (Supplementary Figure S1). The average age difference between cases (age at diagnosis) and controls (current age) for the discovery set and validation set was 11.5 and 8.7 years, respectively. Furthermore, this age difference resulted in 3.3 times as many cases per control in the discovery set compared with the validation set (Supplementary Figure S2).

Control SNPs

For all five PheWAS control SNPs, the ICD9 codes that defined the expected phenotypes were associated with their respective SNPs (Figure 1 and Table 1). Because the PheWAS for rs3135388 was reported previously (Figure 1a),6 focus will be placed on the remaining four PheWAS control SNPs. In almost all cases, the expected ICD9 codes were ranked at or near the top of their respective PheWAS (Table 1), including rs9501572. Rs9501572 tags for HLA-B27 and is known to be associated with ankylosing spondylitis.16 Surprisingly, even with only 10 cases coded for ‘other inflammatory spondylopathies’ (ICD9 720.8), this code was the top PheWAS result for rs9501572 (P=3.3 × 10−4; Figure 1b).

Figure 1
figure 1

Manhattan plots of unadjusted −log10 (P-values) for the 4841 ICD9 and V codes that define the phenome. Highlighted are association results for (a) multiple sclerosis (ICD9 340) for rs3135388, (b) other inflammatory spondylopathies (ICD9 720.8) for rs9501572, (c) pure hyperglyceridemia (ICD9 272.1) for rs328, (d) atrial fibrillation (ICD9 427.31 and 427.3) for rs2200733, and (e) age-related macular degeneration (AMD) (ICD9 362.50, 362.51, 362.52, and 362.5) for rs1061170.

Table 1 PheWAS results for the five positive control SNPs genotyped in the discovery set

PheWAS control SNP rs12678919 is in linkage disequilibrium (LD) with the nonsense SNP rs328 (r2=1). Rs328 induces a premature stop codon in the gene for lipoprotein lipase (LPL). LPL is involved in lipid metabolism and rs328 is associated with both triglyceride and high-density lipoprotein levels.2 The second most significant PheWAS association for rs12678919 was pure hyperglyceridemia (ICD9 272.1, P=1.3 × 10−4; Figure 1c). Interestingly, the ICD9 code for chronic inflammation of the gall bladder (‘chronic cholecystitis,’ ICD9 575.11) was the fourth most significant PheWAS result for this SNP (P=5.0 × 10−4). Elevated triglyceride levels are a risk factor for chronic cholecystitis and may lead to an increased risk for gallbladder cancer.17

PheWAS control SNP rs2200733 has been previously shown to be associated with atrial fibrillation by GWAS.18, 19 The third and fifth most significant phenotypes in this PheWAS were atrial fibrillation codes (ICD9 427.31, P=2.1 × 10−4; ICD9 427.3, P=2.8 × 10−4; Figure 1d). This SNP has also been reported to be associated with ischemic stroke,20 but the associations for the ICD9 codes describing cerebrovascular disease (ICD9 430–438) were not significant.

The strongest PheWAS associations identified were observed for the control SNP rs1061170, a nonsynomous SNP in the gene for complement factor H (CFH). Rs1061170 has been associated with age-related macular degeneration (AMD) and represents one of the most significant SNPs identified by GWAS.2 The four top PheWAS findings for this SNP included nonexudant macular degeneration (ICD9 362.51, P=3.5 × 10−10), exudative macular degeneration (ICD9 362.52, P=9.2 × 10−8), macular degeneration unspecified (ICD9 362.50, P=2.8 × 10−5), and the code that defines AMD more broadly (ICD9 362.5, P=4.0 × 10−7; Figure 1e). Furthermore, ICD9 codes describing other sight loss phenotypes were also associated (eg, visual loss not otherwise specified, ICD9 369.9, P=6.2 × 10−4). These results may indicate high correlations across some ICD9 codings.

Stop-gain/loss SNPs

Discovery set

A total of 105 stop-gain/loss variants were selected for examination by PheWAS. Of the 105 SNPs, 22 are in Online Mendelian Inheritance in Men (OMIM) genes (Supplementary Table S1). In the discovery set, 35 SNPs had at least one ICD9 code with an association of P<1 × 10−4. The strongest PheWAS signal for the stop-gain/loss SNPs was for rs3731608, an apparent nonsense SNP in GABA receptor modulator DBI, with ICD9 V58.6 that defines current long-term drug use (P=3.2 × 10−6; Table 2). Of the 35 SNPs with at least one significant association, 6 had top ICD9 codes with unlikely genetic origins. For example, rs5758511, a nonsense SNP in CENPM, was associated with acquired deformities of the toe (ICD9 735). These six SNPs were not considered for replication studies. Conversely, functional SNPs in OMIM genes with available biological/clinical background information were given extra scrutiny.

Table 2 Association results for top PheWAS signals (P<1.0 × 104) in the discovery set and validation set

SNP rs2736911 introduces a premature stop codon (R38X) in the ARMS2 gene. Multiple SNPs in multiple genes, all in LD with one another, including a missense SNP 3′ downstream of rs2736911 in ARMS2 (rs10490924), have been associated with AMD in multiple GWASs.2 Importantly, this region is believed to be relevant to the less common, ‘wet’ form of AMD.21 According to 1000Genomes and HapMap results, rs2736911 and rs10490924 are not in LD in Caucasian populations (r2=0.001–0.067). The importance of ARMS2 is still uncertain because of unknown function for the missense SNP, LD patterns spanning multiple genes, and little knowledge regarding the function of those genes in relation to AMD.22 Characterizing a presumed functional variant in ARMS2, independent of GWAS SNPs, may assist when understanding the importance of this gene for AMD. Based on PheWAS results, the nonsense SNP rs2736911 was weakly associated with the ICD9 code defining wet AMD (ICD9 362.52, P=0.030; Figure 2a); no other AMD-related ICD9 codes were associated.

Figure 2
figure 2

Manhattan plots of unadjusted −log10 (P-values) for the nonsense variant rs2736911 in ARMS2. Highlighted is the ICD9 code for wet age-related macular degeneration (AMD) in the (a) discovery set, (b) validation set, and (c) meta-analysis.

Validation set

A total of 31 stop-gain/loss SNPs were genotyped in the validation set for replication of discovery set results, including 24 with a clinically relevant ICD9 code (P<1.0 × 10−4) and 7 other SNPs with biologically/clinically interesting PheWAS results (eg, rs2736911 in ARMS2). Replication consisted of an independent PheWAS for the 31 SNPs. At the phenotypic level, none of the top ICD9 codes were replicated in the independent analysis (Table 2). However, the ICD9 code defining wet AMD was consistent with the direction of the discovery set (ICD9 362.52, P=0.081; Figure 2b).

To better characterize the SNPs selected for validation, and because the discovery and validation sets represented unique populations based on age differences described previously (Supplementary Figures S1 and S2), a meta-analysis was conducted using a random effect model with pooling of studies by the Mantel–Haenszel method. As expected, the ICD9 codes with P<1.0 × 10−4 in the discovery set were not significant in the meta-analysis when discovery set P-value biases were considered. Conversely, the fifth most significant association from the PheWAS meta-analysis of rs2736911 was the ICD9 code that defines wet AMD (ICD9 362.52, P=0.0011, OR=0.69; Figure 2c). When considering other GWAS results,2 in combination with the results of the PheWAS described here, it may be more likely that ARMS2 is the candidate involved in wet AMD compared with other genes in the region.

In addition to the disease-specific meta-analyses for the associations identified in the discovery set, a meta-analysis across the phenome of all 31 SNPs genotyped in both the discovery and validation sets was assessed to generate new hypotheses. The most significant PheWAS meta-analysis was between rs1861050 and the ICD9 code that defines ‘encounter for other and unspecified procedures and aftercare’ (ICD9 V58, P=1.8 × 10−6). The importance of this association, and others reported (Supplementary Table S2), is unclear and will require further study.

Discussion

All genetic-based PheWASs published thus far have focused on genetic variants with known phenotypic associations driven primarily by GWAS results.1 The advantage of focusing on GWAS SNPs in PheWAS includes the potential to identify variants with pleiotropic effects, as emphasized by previously published results and associations observed with PheWAS control SNPs in this study. For example, nonsense SNP rs328 in LPL is not only associated with triglyceride levels, but may also be a risk factor for chronic cholecystitis. The use of GWAS SNPs in PheWAS capitalizes on known association data. Unfortunately, the majority of GWAS SNPs, except for the LPL example given above, are intergenic SNPs with unknown function, making translation of association results into biological insight a challenge. An alternative PheWAS approach may focus instead on known functional variants.

Like GWAS, PheWAS is a hypothesis-generating approach that is challenged by multiple comparison testing. The common approach to account for multiple testing in PheWAS is the use of a Bonferroni correction.1, 4, 6, 23, 24, 25 With 4841 phenotypes, 110 SNPs, and a study-wide α of 0.05, an association with P<9.4 × 10−8 would be required to identify a statistically significant association assuming independence. The only association that meets this criterion is that between the PheWAS control SNP rs1061170 in CFH and ICD9 codes that define AMD (Figure 1e). Although a Bonferroni threshold has been commonly applied in PheWASs, a Bonferroni correction may be overly conservative when correlations exist between ICD9 codes.1 Regardless, interesting associations were identified in this study that may provide insight into the genetic etiologies of complex conditions.

We demonstrated that stop-gain/loss variants may be used to identify important genes/polymorphisms involved in human disease. Furthermore, by focusing on functional variants, biological insights may be inferred. For example, several SNPs in the chromosomal region 10q26.13 have been associated with AMD, including wet AMD.2 This region contains multiple genes, including ARMS2 and HTRA1. One of the GWAS-significant SNPs associated with AMD is a missense SNP in ARMS2 (rs10490924). Because of LD across the region and the lack of biological insights for the genes and SNPs in the region, the importance of ARMS2 in AMD is uncertain. Our PheWAS results demonstrate that a nonsense SNP in ARMS2 (rs2736911), independent of the missense SNP, is associated with AMD (Figure 2c). The loss-of-function allele for rs2736911 had a protective effect (OR=0.69) that infers that the missense SNP rs10490924 could result in a gain of function. When considering previously reported GWASs, this PheWAS result strengthens the hypothesis that ARMS2 is involved in the pathophysiology of AMD and that two or more variants in ARMS2 may affect risk. Although unlikely, we cannot rule out the potential that these independent coding variants may be in LD with other functional variants outside of ARMS2.

The example of ARMS2 described above demonstrates that a variant with presumed function can be used to identify gene–disease associations by PheWAS. ARMS2 is one of 22 OMIM genes of focus in the present study, due in part to presumed clinical importance (Supplementary Table S1). However, many of the other SNPs in OMIM genes did not result in expected associations. The lack of associations could be the result of inherent differences between the discovery set and validation set. The discovery set was older than the validation set. As a result, the older discovery set had more cases for every control compared with the validation set (Supplementary Figure S2). This could indicate that a proportion of controls in the younger validation set may end up developing a disease at an older age. The impact of this is uncertain given that the number of controls is far greater than the number of cases for any given phenotype in a PheWAS.

The lack of associations identified for SNPs in OMIM genes could also be the result of incomplete phenomes, uncertainties in disease manifestations, and/or challenges in describing true function for any given stop-gain/loss SNP. For example, rs1861050 is a nonsense variant in CC2D2A. Mutations in CC2D2A are believed to cause COACH syndrome, although rs1861050 has not been implicated. COACH syndrome is a rare autosomal recessive disease related to Joubert syndrome.26 Based on PheWAS results, there were a total of 25 individuals in the discovery and validation set homozygous for the nonsense variant. No ICD9 codes indicative of COACH syndrome were associated with this SNP. This could be explained by a variety of reasons. First, the condition may be too rare to be captured within the phenome. Joubert syndrome, and several other congenital diseases, can be coded as ICD9 759.89. This code does not exist in our phenome. Because of privacy concerns, only those ICD9 codes that existed eight or more times in the population were analyzed.6 Second, COACH syndrome is frequently characterized by mental disabilities, potentially limiting the capacity of affected subjects to consent into PMRP. Alternatively, unaffected individuals homozygous for the nonsense SNP may be an indication of incomplete penetrance. Lastly, and perhaps most likely, challenges remain when interpreting function for some apparent nonsense SNPs.

Stop-gain/loss SNPs were selected because they fall in a class of variation that may be more likely to perturb function and result in a pathogenic effect. MacArthur et al27 were one of the first to systematically survey loss-of-function variants for disease associations. Using the Wellcome Trust Case Control Consortium, 417 loss-of-function variants were associated with seven complex phenotypes in nearly 16 000 patients. With a limited number of phenotypes, this association study was mostly negative, but was able to rediscover an association between a frame shift variant in NOD2 and Crohn’s disease. This study also highlighted challenges when predicting function in that nonsense SNPs are enriched toward the 3′ end of affected genes, suggesting that partial truncation is tolerable.27 Furthermore, SNP function may also be mitigated by unappreciated alternative splicing. In the CC2D2A example mentioned above, CC2D2A encodes three splice isoforms. Rs1861050 codes for a nonsense SNP in mRNA transcript NM_020785, but codes for a synonymous variant in the two other mRNA transcripts (NM_001080522 and NM_001164720). Disease-causing mutations identified in CC2D2A have been detected in the NM_001080522 transcript.28 Interestingly, this SNP has been reported to be associated with conduct disorder by GWAS, although the SNP did not reach GWAS significance (P=8 × 10−6).29 The ICD9 code for conduct disorder (ICD9 314) was modestly associated with the rs1861050 genotype in the meta-analysis (P=0.037). Further investigation into the importance of this SNP in conduct disorder is necessary.

As evident by the five PheWAS controls SNPs (Figure 1 and Table 1), and GWAS SNPs previously assessed by PheWAS,1 it is relatively straightforward to identify true associations when knowledge of expected associations is incorporated into the interpretation of PheWAS results. When there is little information on expected phenotypes, identifying true associations is a challenge. This is especially true when no association meets a conservative Bonferroni correction. Regardless, this PheWAS demonstrates the potential to identify gene–disease associations when focusing on functional variants. This study also highlights the limitations of PheWAS. For example, many conditions in a phenome may be rare. With a small number of cases, the power to detect an association may be difficult. Focusing on deleterious variants may increase power if these variants have higher effect sizes compared with other types of variation,10 such as those identified by GWAS. In the future, the challenge of small sample sizes may be exacerbated when ICD10 coding is implemented. Whereas ICD9 offers 17 000 possible codes, ICD10 has nearly 1 55 000 possible codes.30 The limitation of sample size is undoubtedly temporary. As biobanks continue to grow, and genetics/genomics is incorporated into standard medical practice, sample size may not be a limiting factor in future PheWASs. This expectation may open the door to PheWASs that focus on presumed functional variants of unknown significance identified by clinical next-generation sequencing.

As demonstrated by this and other reports, PheWAS has the capacity to rediscover SNP–disease associations when known phenotype data are incorporated into the interpretation of PheWAS results. This study also demonstrates that focusing on variants with presumed function and incorporating biological insights may be an effective PheWAS strategy to identify SNP–disease associations. Importantly, this approach may offer additional biological insights that GWAS SNPs tend not to provide as demonstrated by the ARMS2 nonsense SNP described above. As the genome and the phenome become better defined, PheWAS may provide an effective role when evaluating the use of human genetics for the application of individualized medicine.