Background Over 160 disease phenotypes have been mapped to the major histocompatibility complex (MHC) region on chromosome 6 by genome-wide association study (GWAS), suggesting that the MHC region as a whole may be involved in the aetiology of many phenotypes, including unstudied diseases. The phenome-wide association study (PheWAS), a powerful and complementary approach to GWAS, has demonstrated its ability to discover and rediscover genetic associations. The objective of this study is to comprehensively investigate the MHC region by PheWAS to identify new phenotypes mapped to this genetically important region.
Methods In the current study, we systematically explored the MHC region using PheWAS to associate 2692 MHC-linked variants (minor allele frequency ≥0.01) with 6221 phenotypes in a cohort of 7481 subjects from the Marshfield Clinic Personalized Medicine Research Project.
Results Findings showed that expected associations previously identified by GWAS could be identified by PheWAS (eg, psoriasis, ankylosing spondylitis, type I diabetes and coeliac disease) with some having strong cross-phenotype associations potentially driven by pleiotropic effects. Importantly, novel associations with eight diseases not previously assessed by GWAS (eg, lichen planus) were also identified and replicated in an independent population. Many of these associated diseases appear to be immune-related disorders. Further assessment of these diseases in 16 484 Marshfield Clinic twins suggests that some of these diseases, including lichen planus, may have genetic aetiologies.
Conclusions These results demonstrate that the PheWAS approach is a powerful and novel method to discover SNP–disease associations, and is ideal when characterising cross-phenotype associations, and further emphasise the importance of the MHC region in human health and disease.
- phenome-wide association study (PheWAS)
- Genome-Wide Association Study (GWAS)
- major histocompatibility complex (MHC)
- Precision Medicine
Statistics from Altmetric.com
- phenome-wide association study (PheWAS)
- Genome-Wide Association Study (GWAS)
- major histocompatibility complex (MHC)
- Precision Medicine
Over the last decade, more than a thousand unique phenotypes have been associated with thousands of loci by genome-wide association study (GWAS).1 Interestingly, according to the National Human Genome Research Institute (NHGRI) GWAS Catalog, 2.5% of GWAS SNPs, some with potential pleiotropic properties and 13.5% of phenotypes can be mapped to a 4 Mb region on chromosome 6p, encompassing the major histocompatibility complex (MHC) gene cluster. For example, age-related macular degeneration, drug-induced liver injury and schizophrenia, along with many inflammatory and autoimmune conditions, such as ankylosing spondylitis, psoriasis and type I diabetes, are mapped to the MHC region.1 In addition to these GWASs, the Immunochip Consortium has fine-mapped approximately 20 autoimmune diseases to the MHC region,2 while others have studied this region by candidate gene association studies, gene expression studies and protein structural variant analyses.3–5
The MHC region is characterised by human leucocyte antigen (HLA) class I and class II gene clusters. HLA genes encode proteins that modulate both innate and adaptive immune response. HLA class I proteins present epitopes from inside the cell (eg, viruses) to identify cells targeted for cytotoxic T-cell digestion. Class I MHC molecules are presented as transmembrane glycoproteins that consist of two polypeptide chains, α and β2-microglobulins.6 Class II proteins present foreign antigens from outside the cell to stimulate helper T cells and B cells to activate the complement and antibody systems.7 ,8 Class II MHC molecules consist of α and β chains that are encoded by HLA-DP, HLA-DQ or HLA-DR.9 The MHC region is one of the most polymorphic regions in the human genome. Strong genetic associations between the HLA variants and autoimmune disease have been established for many years. For example, HLA-B27 has been known to be the major susceptibility gene for ankylosing spondylitis, a complex disease that is characterised by inflammation and ankylosis. Variants in this gene are present in over 90% of patients with ankylosing spondylitis. In another example, the major T1D susceptibility locus maps to the class II loci HLA-DRB1 and HLA-DQB1. Variants in this region may account for 30%–50% of genetic T1D risk.10 In addition to HLA genes, numerous other genes in the MHC region, such as TNF, MICA, MICB and MOG, also have apparent immunological roles.11–13 The potential importance of this genetic locus in disease aetiology is also highlighted by the significant proportion of conditions that are genetically mapped to these genes despite occupying a small proportion (only 0.1%) of the human genome.1 As such, the phenome-wide association study (PheWAS) design may be a powerful method to evaluate genetic variants in this important genetic region.
The PheWAS approach reverses the paradigm of GWAS by using a genotype-to-phenotype strategy to identify diseases that are associated with an individual genetic variant. A commonality of most PheWASs is the use of an electronic health record (EHR) to define a phenome, often relying on standardised International Classification of Diseases, V.9 (ICD9) codes to define the disease status.14 A challenge with PheWAS is that some phenotypes may be individually rare. Regardless, the PheWAS approach has demonstrated its capacity to rediscover important genetic associations identified previously by GWAS, has the capacity to identify novel associations and is ideal when characterising cross-phenotype associations.15–20 This may be particularly relevant for HLA variants located in the human MHC region on chromosome 6. For example, the first proof-of-concept PheWAS focused on the HLA DRB1*1501 variant previously associated with multiple sclerosis (MS). This PheWAS was able to demonstrate the importance of this variant in MS, and was able to identify a novel association with erythematosus conditions,15 an association that was subsequently confirmed in an independent PheWAS.17
With many phenotypes already mapped to the MHC region by GWAS, and with many more phenotypes yet to be studied by GWAS, we hypothesised that this region may contain additional genetic associations and that the PheWAS technique may be leveraged to identify such associations. To address this hypothesis, we conducted a comprehensive PheWAS of 2692 genetic variants across the MHC region spanning 4 Mbs on chromosome 6. PheWAS results confirmed many expected associations and identified many novel associations with immunological diseases not yet assessed by GWAS.
Materials and methods
This study was approved by the Marshfield Clinic's Institutional Review Board (approval number HEB10112). Written and informed consent was acquired for all participants.
Genotyped samples have been described elsewhere21 and have been applied previously to PheWAS.14 ,22 ,23 Briefly, all genotyped individuals were self-identified white/non-Hispanic Marshfield Clinic patients recruited into the Personalised Medicine Research Project (PMRP). PMRP represents a homogenous population with 77% of participants claiming German ancestry. In this PheWAS, 7481 patients were used for discovery and 3887 patients were used for independent validation. Discovery set participants were all over age 40 (mean 59 years), have on average over 30 years of EHR data and have been genotyped by Illumina HumanCoreExome BeadChip (San Diego, California, USA) as described below. Validation set participants were all over age 50 (mean 74 years) and had comparable years of EHR data.
In addition to PMRP, a cohort of 16 484 Marshfield Clinic twins was evaluated for disease concordance. Diseases studied in this population included eight phenotypes associated with the MHC locus in replication studies. Information used to identify twins included shared last name, date of birth, home address, healthcare/billing account and/or clinical documentation suggesting they were twins. Individuals in Marshfield Clinic Twin Cohort (MCTC) are on average 30 years of age with 8.8 years of EHR data. Although zygosity information is unavailable in MCTC, we developed a method that assesses disease concordance rates in twins to study potential genetic diseases as described previously.22 Significance was measured by determining if a disease co-occurred in pairs of twins more frequently than by chance, given the disease frequency in the cohort. Like the PheWAS, phenotypes were defined by ICD9 coding.
DNA samples from 7481 patients in the discovery set were genotyped by Illumina HumanCoreExome BeadChip. The exome chip consists of 569 645 variants across the genome. For the purpose of the current study, we analysed SNPs from the MHC region that spans chr6: 29091311-33821793 (hg19). After filtering out poor-quality and rare SNPs , 2692 variants remained with minor allele frequency (MAF) ≥0.01, including 142 previously defined (30 October 2014) ‘GWAS significant’ SNPs (p≤5.0E-08).1 For validation, 281 SNPs genotyped on the Illumina Human660W-Quad BeadChip were used.
ICD9 code and phenome
The phenome was defined by patient EHR data as described previously.17 ,23 ,24 Briefly, ICD9 codes were used to define cases and controls at varying levels of phenotypic resolution using a roll-up strategy (eg, ICD9 720.89→720.8*→720.*). Patients coded for any one specific code (eg, ICD9 720.89) became ‘cases’ for that code, whereas those not coded for the specific code or related codes (eg, ICD9 720*) became ‘controls’. For common ICD9 codes (≥300 individuals), cases were defined by those coded two or more times (‘rule-of-two’); those coded only once were not considered a case or a control. For less frequent ICD9 codes (<300 individuals), all individuals coded for that ICD9 code were designated as a case. As requested by Marshfield Clinic's Institutional Review Board, case status was not defined for rare ICD9 codes (<9 individuals) to protect patient privacy, as PMRP participants originate from a very specific region in Central Wisconsin. A total of 6221 phenotypes/ICD9 codes defined the phenome for this PheWAS.
Because some diseases represented rare phenotypes and some SNPs had low MAFs (<0.05), a Fisher's exact test for allelic association was calculated with Plink V.1.9 (http://pngu.mgh.harvard.edu/purcell/plink/).25 PheWAS was performed on 2692 SNPs including 142 ‘GWAS significant’ SNPs. A suggestive p value cut-off in the discovery set (p≤5E-05) was used to identify associations to be assessed for independent replication. We further applied logistic regression analysis using sex and years of EHR data as potential covariates for all 6221 phenotypes (see online supplementary tables S1 and S2). The rs1794275 genotype was also included as a potential covariate when further studying lichen planus. False discovery rate (FDR) was calculated.26 A meta-analysis was conducted using Fisher's method by Ri386 V.3.1.0 (R Development Core Team. R: a language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN: 3-900051-07-0. 2011. http://www.R-project.org /) for SNP–disease pairs where phenotypic and genotypic data were available in the independent replication set. This analysis considered the direction of effect (ie, ORs) in the two datasets. HLA classical haplotypes were imputed using HIBAG R-package.27 ,28 Associations between classical HLA haplotypes and lichen planus were analysed using Fisher's exact test that compared each haplotype with all other haplotypes.
Supplementary table 1
Supplementary table 2
To characterise cross-phenotype associations in the MHC region, a trait-based association test (TATES) was conducted for 2692 SNPs, including 142 ‘GWAS significant’ SNPs. TATES analyses were conducted on ICD9 codes that define 12 broad disease categories and those that mapped to inflammatory phenotypes defined by PheCodes.29 For individual phenotypes, TATES combines the p values obtained in a single marker test to arrive at a global p value while correcting for observed correlational structure between phenotypes.20 ,30
PheWAS for ‘GWAS significant’ SNPs
Our initial PheWAS focused on identifying novel associations for GWAS SNPs that mapped to the MHC region [chr6:29091311-33821793 (hg19)]. According to the NHGRI GWAS Catalog, 259 SNPs had ‘GWAS significant’ associations (p≤5.0E-08);1 142 of these SNPs were genotyped by Illumina HumanCoreExome BeadChip. Of these 142 SNPs, 9 were associated with 10 ICD9 codes representing four general phenotypes that passed a conservative Bonferroni correction (p≤5.7E-08; assuming experimentwise α=0.05, 142 SNPs and 6221 phenotypes) in our discovery set (table 1). Three of the nine SNPs agreed with previous GWAS findings. For those that did not agree, most of the phenotypes were not adequately captured in the PheWAS, either because of rarity or lack of specific ICD9 codes that described the expected phenotype. Rediscovered phenotypes included reconfirming associations with coeliac disease, psoriasis and ankylosing spondylitis. For example, we replicated an association between rs2187668 and coeliac disease (p=1.9E-08, FDR=0.0072).32 ,33 Rs3131296 and rs2071278, previously shown to be associated with schizophrenia34 and levels of complement factors C3 and C4,35 respectively, and in partial linkage disequilibrium (LD) with rs2187668 in the PMRP sample (r2=0.53 and 0.45, respectively), were also associated with coeliac disease. Similar scenarios of multiple SNPs in LD sharing common associations were observed for other phenotypes (table 1). In addition to these rediscoveries, we also identified an association between rs1794275 and lichen planus (p=1.8E-08, FDR=0.0071). Lichen planus is an inflammatory condition that can affect skin and mucous membranes and has not been studied by GWAS. This SNP was previously shown to be associated by GWAS with IgA nephropathy in an East Asian population.36 Furthermore, this SNP has been associated with MS, primary biliary cirrhosis, rheumatoid arthritis and type I diabetes by Immunochip Consortium.2
PheWAS across entire MHC region
To more comprehensively assess the MHC locus, we conducted PheWAS for all HumanCoreExome SNPs mapped to this region (MAF ≥0.01, n=2692 SNPs). This analysis rediscovered statistically significant associations for psoriasis, ankylosing spondylitis and type I diabetes (p≤3.0E-09; assuming experimentwise α=0.05, 2692 SNPs and 6221 phenotypes). Using a suggestive p value threshold (p≤5.0E-05) in the discovery set, there were 1464 associations consisting of 895 SNPs and 425 phenotypes. Since many of these associations are not overly significant given FDR >0.05, we expect some false positives (see online supplementary table S1). Among these 1464 associations, 470 SNP–disease pairs, consisting of 281 SNPs and 214 phenotypes, could be assessed in the independent validation set with available data. Of the 470 pairs, 64 SNP–disease pairs had suggestive evidence in the validation set (p≤0.05), 58 SNP–disease pairs (91%), consisting of 44 SNPs and 23 phenotypes, demonstrating similar directions of effect in both datasets indicating potential enrichment for true associations (see online supplementary table S2). Among the 23 disease phenotypes, 8 diseases have not been characterised by previous GWAS. These eight diseases were associated with 16 SNPs (table 2). Under this scenario, lichen planus again showed associations with MHC SNPs, including the GWAS SNP described previously (rs1794275). As also mentioned previously and further described below, lichen planus is an immune-related condition. Based on clinical descriptions, many other conditions may also have immunological aetiologies (table 2).
To determine if any of the eight diseases may have underlying genetic aetiologies, 16 484 twins in the MCTC were assessed.22 The analysis considered the disease concordance rates in families of twins relative to the disease frequency in the population. In general, most diseases were rare in the twin cohort given that MCTC represents younger patients (average 30 years of age) with fewer years of EHR data (average 8.8 years) compared with patients in PMRP (average >59 years of age and >30 years of EHR data). Even with this limitation, three diseases had suggestive evidence (p<0.05) that these conditions cosegregated in families of twins and may be driven in part by genetics. With 46 affected and 1 family of disease-concordant twins, lichen planus approached significance (p=0.062) (see online supplementary table S3). The most significant phenotype included dyshidrosis, a skin condition that results in small fluid-filled blisters often affecting the hands and can be associated with atopic dermatitis and other allergic conditions.37 In MCTC, 133 twins were affected with dyshidrosis, including seven families of disease-concordant twin families (p=1.6E-6).
Supplementary table 3
Lichen planus associations in the MHC region
Lichen planus is known to be T cell-mediated and can be caused by MHC-linked graft-versus-host disease from allogenic bone marrow transplantation.38 Since lichen planus was the most significant phenotype identified in the current study, but not previously studied by GWAS, we conducted follow-up analyses. To further assess lichen planus associations in the MHC region, logistic regression was conducted in the discovery set. Except for the association with top ‘GWAS significant’ SNP rs1794275, lichen planus also showed associations with five other SNPs across the MHC region (p≤1.9E-5; assuming experimentwise α=0.05, 2692 SNPs) (figure 1A). All six SNPs were common variants with similar ORs (2.0–2.5) for the minor alleles. Four of the six associations were confirmed in the validation set (p<0.05) (table 3). However, LD analysis indicated that these SNPs were in partial or strong LD with the most significant SNP (rs1794275), indicating that these SNP associations were likely due to LD. Indeed, significance of these associations was diminished when the effect of rs1794275 was adjusted by logistic regression (figure 1B). For follow-up, and to provide potential functional insights, we assessed associations between classical HLA haplotypes and lichen planus. Haplotype HLA DQB1*05:01 had the strongest association with lichen planus (p=8.0E-08). Similar haplotype analyses were conducted for the seven other novel phenotypes (table 2 and see online supplementary table S4).
Supplementary table 4
Cross-phenotype association analysis of the MHC region
Because many disease-associated variants have been mapped to the MHC region by GWAS and now PheWAS, we characterised cross-phenotype associations for 2692 HumanCoreExome BeadChip MHC SNPs. Focus was on 358 ICD9 codes that define inflammatory phenotypes.29 Five SNPs were statistically significant for cross-phenotype associations (p≤1.9E-05; assuming experimentwise α=0.05, 2692 SNPs), including rs4349859, rs4418214, rs9391846, rs12175489 and rs2844505 (figure 2). The top SNP with the most significant cross-phenotype associations was the GWAS SNP rs4349859 (p=2.3E-14). Rs4349859 was most strongly associated with ankylosing spondylitis, as described previously (ICD9 720; p=8.7E-17; table 1), along with multiple arthritic phenotypes including rheumatoid arthritis (ICD9 714; p=0.004), unspecified polyarthropathy or polyarthritis of ankle and foot (ICD9 716.57; p=0.005) and localised osteoarthrosis of the pelvic region and thigh (ICD9 715.35; p=0.008) (see online supplementary table S5). These results are not surprising given the observed genetic overlap between ankylosing spondylitis and rheumatoid arthritis.39 Finally, rs4349859 was also associated with iridocyclitis (ICD9 364.02; p=8.0E-06) (see online supplementary table S5), an inflammatory condition of the iris and ciliary body that can affect those diagnosed with ankylosing spondylitis.40 These significant results generated in a single experiment expand on observations of cross-phenotype associations for the MHC region described by many GWASs.
Supplementary table 5
Our study is the first comprehensive PheWAS to examine SNPs mapped to the MHC region. As described previously, the MHC region may be involved in the aetiology of many diseases, including unstudied conditions. Diseases such as autoimmune, inflammatory and malignant diseases are significantly more common among individuals carrying particular HLA alleles.10 Many of the genes in the MHC region are hypermutable, which is fundamental for their function. This is particularly relevant for HLA genes involved in the induction and regulation of the immune responses. As such, the PheWAS may serve as a powerful tool for studying genetic variants in this important genetic locus. Our study confirmed significant associations between SNPs and disease phenotypes identified in previous GWASs,1 including ankylosing spondylitis, psoriasis, coeliac disease and type I diabetes, some with strong cross-phenotype associations that may be indicative of pleiotropy. Most importantly, we further demonstrated PheWAS' capacity to discover suggestive SNP-disease associations for eight diseases that have not been studied by GWAS. Multiple diseases, including lichen planus, are likely immune-related diseases, reflecting the importance of the MHC region in human immune genetics (see online supplementary table S2). Results from MCTC provide further support that these conditions may be influenced by genetic variation, although we cannot rule out shared environmental effects (see online supplementary table S3). Additional genetic studies focused on these diseases may be needed.
Lichen planus was the most significant disease associated with MHC SNPs not previously studied by GWAS.1 Our study revealed six SNPs, including one GWAS significant SNP (rs1794275), associated with lichen planus across a 400 kb region, including HLA-DQB1 (figure 1 and table 3). These six SNPs have also been reported to be associated with MS, type I diabetes and other immune diseases.2 Interestingly, there is evidence that individuals with MS and type I diabetes are at increased risk for lichen planus.41 ,42 Since many of these variants are in partial LD, we suspect that these SNPs probably tag for the same functional variant or haplotype. Haplotype analysis demonstrated HLA-DQB1*05:01 was strongly associated with lichen planus. Notably, HLA-DQB1*05:01:01 has been implicated in lichen planus previously by a candidate gene association study.43 However, other candidate gene studies have implicated other HLA haplotypes in lichen planus.44 ,45 Since individuals with lichen planus are at increased risk for carcinoma,46 in-depth future studies of the genetic aetiology of lichen planus may be warranted.
Like GWAS, PheWAS is challenged by multiple comparison testing and frequently applies a Bonferroni adjustment.14 This study also included variants with MAF ≥0.01, including coding variants, as potential candidates. Variants with MAF <0.05 may have higher effect sizes compared with common variants, but if not, they may increase the burden to identify statistically significant associations. The ability to identify statistical significance is further limited by the inherent nature of the phenotype(s) being studied (eg, heritability, polygenicity, case/control specificity and sample size). In a PheWAS strategy, thousands of phenotypes can be studied simultaneously, but some individual diseases may be rare, have weak genetic aetiologies or be sex-specific. The culmination of these challenges will influence power to detect associations. We attempted to account for these difficulties by applying conservative Fisher's exact tests and Bonferroni adjustments when interpreting results. It is expected that larger cohorts linked to extensive phenotypic data, such as the anticipated ‘precision medicine initiative’ of over one million individuals,47 will be ideal for PheWAS to better assess the importance of thousands of phenotypes including rare and sex-specific diseases.
Unique to PheWAS is the potential correlative structure in the phenotypic data. In an ICD9-based PheWAS, similar ICD9 codes may be correlated. Furthermore, correlations exist across codes.14 Under this circumstance, a stringent Bonferroni correction may be overly conservative. Regardless, we did identify multiple associations that surpassed the conservative multiple testing threshold, but we suspect that additional novel associations remain to be elucidated by follow-up PheWASs and disease-specific studies.
The MHC region brings unique challenges in genetic study not limited to PheWAS. Specifically, this locus includes extreme sequence diversity, structural variants, high gene density with considerable homologies and substantial LD spanning Mbs of sequence.48 Due to these inherent qualities, characterising the candidate genes and their presumed functional variants can be a significant challenge. These limitations also restrict SNP genotyping array design with many genotyped SNPs, unable to fully address these complexities. It should be noted that other SNP arrays, including Illumina ImmunoChip,2 provide higher density genotyping for this region compared with the Exome Chip platform. In the future, new long-range sequencing technologies,49 in combination with the PheWAS approach, may prove necessary to comprehensively study this biologically and clinically important region.
In conclusion, our results expand on what multiple GWASs have identified in that the MHC locus is an important region in human health and disease. Furthermore, this study builds on the growing evidence that the PheWAS technique may be a highly efficient method to detect novel SNP-disease associations and may be ideal when characterising cross-phenotype associations (figure 2). With patient cohorts expanding into the millions linked to extensive phenotypic data, it will be conceivable to conduct a ‘GWAS-by-PheWAS’ in a single experiment where all diseases are studied at the GWAS level and all variants are studied at the PheWAS level. Such strategies may lead to significant advancements in precision medicine.
The authors would like to thank Rachel Stankowski for her assistance in editing this manuscript and Dr Joshua Denny for providing the most recent version of the PheWAS R-package.
Contributors SJH and JL designed the study, and SJH oversaw all aspects of the study. JL performed laboratory experiments. JL, JGM, BAH and ZY generated and analysed association data. CG, LR and CC provided clinical interpretations. S-SK, XZ, TM and KT provided expertise in haplotype analysis. MHB and SJH provided material support. JL and SJH wrote the manuscript with input from all other authors.
Funding This work was supported by the generous donors of the Marshfield Clinic, National Institutes of Health National Center for Advancing Translations Sciences grant number UL1TR000427, National Human Genome Research Institute grant number 1U01HG006389, National Institute of General Medical Science grant number 1R01GM114128 and National Library of Medicine grant number 1K22LM011938.
Competing interests None declared.
Patient consent Obtained.
Ethics approval Marshfield Clinic IRB.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement All relevant data are provided in supplemental material. According to Marshfield Clinic IRB, and as approved by USA National Institute of Health, individual-level medical record data cannot be freely shared, to protect patient privacy.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.