Article Text


Original article
A bias-reducing pathway enrichment analysis of genome-wide association data confirmed association of the MHC region with schizophrenia
  1. Peilin Jia1,2,
  2. Lily Wang3,
  3. Ayman H Fanous4,5,6,
  4. Xiangning Chen4,
  5. Kenneth S Kendler4,
  6. The International Schizophrenia Consortium*,
  7. Zhongming Zhao1,2
  1. 1Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
  2. 2Department of Psychiatry, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
  3. 3Department of Biostatistics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
  4. 4Department of Psychiatry and Virginia Institute for Psychiatric and Behavior Genetics, Virginia Commonwealth University, Richmond, Virginia, USA
  5. 5Washington VA Medical Center, Washington, District of Columbia, USA
  6. 6Department of Psychiatry, Georgetown University School of Medicine, Washington, District of Columbia, USA
  1. Correspondence to Dr Zhongming Zhao, Department of Biomedical Informatics, Vanderbilt University School of Medicine, 2525 West End Avenue, Suite 600, Nashville, TN 37203, USA; zhongming.zhao{at}


Background After the recent successes of genome-wide association studies (GWAS), one key challenge is to identify genetic variants that might have a significant joint effect on complex diseases but have failed to be identified individually due to weak to moderate marginal effect. One popular and effective approach is gene set based analysis, which investigates the joint effect of multiple functionally related genes (eg, pathways). However, a typical gene set analysis method is biased towards long genes, a problem that is especially severe in psychiatric diseases.

Methods A novel approach was proposed, namely generalised additive model (GAM) for GWAS (gamGWAS), for gene set enrichment analysis of GWAS data, specifically adjusting the gene length bias or the number of single-nucleotide polymorphisms per gene. GAM is applied to estimate the probability of a gene to be selected as significant given its gene length, followed by weighted resampling and computation of empirical p values for the rank of pathways. We demonstrated gamGWAS in two schizophrenia GWAS datasets from the International Schizophrenia Consortium and the Genetic Association Information Network.

Results The gamGWAS results not only confirmed previous findings, but also highlighted several immune related pathways. Comparison with other methods indicated that gamGWAS could effectively reduce the correlation between pathway p values and its median gene length.

Conclusion gamGWAS can effectively relieve the long gene bias and generate reliable results for GWAS data analysis. It does not require genotype data or permutation of sample labels in the original GWAS data; thus, it is computationally efficient.

  • Schizophrenia
  • gene set enrichment analysis
  • GWAS
  • pathway
  • generalised additive model
  • cancer: lung
  • epigenetics
  • genetics
  • genome-wide
  • psychiatry

Statistics from


Recent successes of genome-wide association studies (GWAS) have identified hundreds of common variants susceptible to complex diseases.1 Although these newly discovered loci have greatly expanded our knowledge of diseases, their effects are generally moderate (eg, odds ratio (OR) for individual markers <2) and their functional impacts are unclear in most cases. An important research topic in the post-GWAS era is developing approaches to identifying genetic variants that might have a significant joint effect on complex diseases despite the individually weak to moderate marginal effect. Currently, gene set based analysis, which investigates the joint effect of multiple functionally related genes (eg, pathways), provides an alternative yet effective approach for discovering candidate genes and their pathways from the original GWAS datasets.2–5

Recent gene set analyses have successfully identified novel candidate genes and pathways for several psychiatric diseases such as schizophrenia4 and bipolar disorder.5 Nevertheless, a major problem in these applications, gene length bias, has not been well addressed. In a typical gene set based analysis of GWAS data, when mapping single-nucleotide polymorphisms (SNPs) to genes, long genes tend to cover more SNPs, and thus have a greater chance to harbour significant SNPs (eg, SNPs with nominal p values less than a certain cut-off value). This problem, which has impeded many methods from being applied widely, is especially severe in psychiatric diseases, as neurodevelopmental or brain related genes tend to be long (supplementary figure 1).6 In this context, when a pathway with long genes is found to be statistically significant, it is difficult to determine whether the pathway is indeed associated with the disease, or is selected due to the long gene effect.

Here, we have proposed a new strategy to adjust the gene length bias using only summarised GWAS data. We introduced the generalised additive model (GAM) to estimate the probability of a gene being selected as a significant gene as a function of its gene length and adjusted the bias on the pathway level by weighted resampling. We named this approach gamGWAS (generalised additive model for GWAS analysis). We have demonstrated gamGWAS in two well characterised GWAS datasets for schizophrenia, which have been shown to have a major signal located in the MHC region in the original studies.7 8 These datasets are unique for demonstration of pathway enrichment methods, as schizophrenia is a complex psychiatric disorder and its association signals are convergent, yet not limited to a few single markers. Accordingly, we would expect that a powerful pathway enrichment method could identify significant pathways consisting of a number of the MHC genes, as well as other pathways that could not be detected by traditional single marker analysis. As we report here, we have shown that by effectively reducing the correlation between significant pathways and gene length, gamGWAS could top rank several immune related pathways consisting of MHC genes, which confirmed the results from single marker analysis. Among those pathways, the cell adhesion molecules (CAMs) pathway is particularly interesting, as it has previously been reported as the only significant pathway for schizophrenia and bipolar disorder at the pathway level using the SNP ratio test method,4 rather than by single marker analysis. In this study, the CAM pathway was confirmed by gamGWAS. In addition to the confirmation of previous work, gamGWAS analysis also found several novel pathways, such as allograft rejection and autoimmune thyroid disease. These pathways are involved in the general immune system and are consistent with the MHC and CAMs findings, suggesting that changes in the immune system might be linked to schizophrenia. Finally, the gamGWAS may be applied to any other complex diseases, including psychiatric disorders.


GWAS datasets

The GWAS dataset from the International Schizophrenia Consortium (ISC) was genotyped using Affymetrix Genome-Wide Human SNP 5.0 and 6.0 arrays.8 The data were delivered to us after quality control of individuals and markers by the ISC analysis team. Its samples were collected from eight sites in Europe and the USA, and potential population stratification was adjusted by the Cochran–Mantel–Haenszel (CMH) test in the original study. Here, we followed the statistical analysis procedure in the original publication.8 Specifically, we used the CMH test implemented by PLINK (∼purcell/plink/)9 for the single marker association test. The second GWAS dataset we used in this study is from the Genetic Association Information Network (GAIN) for schizophrenia. The GAIN dataset was genotyped using the Affymetrix 6.0 array and was granted for our use by the GAIN Data Access Committee (DAC request #4532-2). We downloaded the data from NCBI dbGaP ( and performed quality control through the following criteria (see also Jia et al3): individuals with genotype rate <95% were removed and SNPs with genotype rate <95% or minor allele frequency (MAF) <0.05 were removed. According to the original study,7 there was no significant population stratification found in the GAIN samples. Thus, we used the Armitage trend test implemented in PLINK for the single marker association test. For both GWAS datasets, we computed the genomic inflation factor to indicate the quality of the association results.

We mapped a SNP to a human gene (NCBI 36.3) if it was located within the gene or 20 kb immediately upstream or downstream of the gene. We followed a widely used strategy to select SNPs to represent genes—that is, the SNP with the most significant association value was selected to represent the gene and its p value was assigned to the gene accordingly. For comparison, we also computed gene-wise p values using the VEGAS (VErsatile Gene-based Association Study) method.10 This method takes the linkage disequilibrium (LD) structures among SNPs related to a gene and combines multiple SNPs instead of using only the most significant one. We used the genotyping data of ISC and GAIN to compute population-specific LD structures and took all the SNPs mapped within 20 kb of each gene to compute the gene-wise p values.

We retrieved pathways from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database11 that was included in the R package (28 February 2010). To avoid pathways with too narrowed or too generalised definitions of genes, we considered only those with the number of genes ≥5 and ≤250. This resulted in a total of 201 pathways for our subsequent analysis.

Correcting gene length bias via GAM

Our work was inspired by a recent study of RNA sequencing (RNA-Seq) data analysis by Young et al,12 who used a weighted resampling strategy to adjust for transcript length bias. In their work, GAM was applied to estimate the probability of a gene being selected as significantly differentially expressed genes as a function of its transcript length. Young et al fitted a monotonic cubic spline with six knots in their application and showed that it performed better in terms of model fit, when compared to other functional forms such as the generalised linear model.12 Here, we adapted this algorithm to adjust for gene length bias in the analysis of GWAS datasets. Our strategy is as follows.

First, the GWAS dataset was preprocessed and SNPs were mapped to genes. For each gene, a gene-wise p value was assigned as the most significant SNP p value among all SNPs located within the gene. Let gi be the gene-wise p value for the i-th gene and G = {gi; i =1,…,n} be the set of all gene-wise p values for a total of n genes in a GWAS dataset. Here, G forms a gene pool (ie, a set of gene scores) for the subsequent resampling (see details below).

Let Yi be an indicator variable for a gene being significant or not: Yi={10ifgi<Cotherwise, where C is a predefined cut-off value, for example C=0.05, or C=0.01. Genes with Y=1 are denoted as ‘SigGene(s)’.

Second, for weighted resampling, we estimated a weight for each gene based on its estimated probability of being a SigGene as a function of gene length from GAM. More specifically, in this work, we fitted the following logistic GAM model to the data: log{P(Yi=1)1P(Yi=1)}=α+f(gene lengthi) where α is the intercept and f(.) represents an unspecified smooth function. We used a monotonic cubic spline with six knots to estimate the functional form of f(.).

The estimated probability functions of SigGene given gene lengths are shown in supplementary figures 2 and 3. Based on this function, a weight vector for all the genes could be constructed as w = [w1,…,wn], where n is the number of genes, and wi is the predicted probability for gene i being SigGene based on the logistic GAM model above. That is, wi=P^(Yi=1) where ^ indicates an estimate. This can be implemented using the gam function in the R package mgcv.13 Specifically, we used the function gam(y ∼ s(x, k = nKnots, bs = “cr”), family = binomial()), where y denotes the proportion of SigGenes, x denotes gene length, nKnots is 6, and cr denotes cubic regression splines. As shown in supplementary figures 2 and 3, long genes have higher weights.

Third, we performed weighted resampling 10 000 times using the summarised GWAS data with p value for each gene. For each resample, m genes in the gene pool G = {gi}, i =1,…,n were selected without replacement according to their weights w and designated as SigGenes, where m is the observed number of SigGenes in the real GWAS sample. More specifically, for each resample, we first ordered genes according to ri1/wi, i =1,…,n, where ri[0,1] is random number which can be generated by most computer software such as R. After all genes from G were ordered, the top m genes with the largest values for ri1/wi were then assigned as SigGenes for the resample. Note that for longer genes with large values for wi, limwi1ri1/wi=ri, and for shorter genes with small values for wi, limwi0ri1/wi=0. Therefore, for each resample, long genes are more likely to be selected as SigGenes, and these random sets will have the same bias towards genes with longer gene length as the real sample. Finally, for each resample, we recorded the number of SigGenes from each pathway.

Note that in the standard randomisation procedure, only ri will be used to rank genes, based on the assumption that each gene has an equal chance to be selected. In contrast, in weighted resampling, we introduced wi to weigh the randomisation so that long genes with higher weights will be ranked on top of the list and thus more likely to be selected as SigGenes. This is a major improvement in our method compared with the standard randomisation procedure for the hypergeometric test.

Finally, for each pathway, let k be the observed number of SigGenes from the real GWAS sample, and K be the number of SigGenes in a resample. Then, an empirical p value was computed for the pathway using all resamples by the following equation:Pemp=Pr{Kk}=#resamples{Kk}+1total#of resamples+1(1)

Since in weighted resampling, the number of SigGenes in each pathway was compared to the distribution from gene-length-adjusted resamples (or adjusted by the number of SNPs depending on which one is being adjusted, see details below), the empirical p values were thus adjusted for gene length. We denoted this method as gamGWAS. The empirical p values were then adjusted for multiple testing using the method of Benjamini and Hochberg.14

Cluster of genes with overlapping SNPs

When using the most significant SNP to represent a gene, it is possible that a SNP is considered the most significant in multiple genes. This overlapping issue may introduce bias in pathway analysis. For example, if multiple genes share a most significant SNP, and those genes are in the same pathway, then this pathway would have much greater chance to be identified as being significantly enriched, even though the significance is in fact driven by the same signal being used multiple times.15

To overcome this challenge, for each pathway, when there were at least three closely located genes sharing the same gene-wise significant SNP, we clustered them into one cluster (figure 1). For example, the genes UGT1A1, UGT1A3, UGT1A4, UGT1A5, UGT1A6, UGT1A7, UGT1A8 and UGT1A9 were closely located on chromosome 2 and had the same most significant SNP, SNP_A-8705853, in the GAIN dataset. These genes function closely in the pathways such as hsa00500: starch and sucrose metabolism, and hsa00040: pentose and glucuronate interconversions. In our clustering step, we grouped these genes according to their physical locations on the chromosome and treated them as a gene-cluster (functionally comparable to a real gene) in the subsequent analysis. That is, we took the most left and most right physical positions of these genes as the start and end point of the newly defined ‘gene-cluster’ and assigned the SNPs located within the region to it. We treated the gene clusters in the same way as real genes and denoted each cluster as a single gene in the corresponding pathway. In this way, while the total number of genes was reduced in the original gene pool, the total number of genes in an affected pathway was also reduced accordingly to make the approach statistically appropriate. In our analysis of each GWAS dataset, we did a systematic search of gene clusters before we performed pathway analysis and resampling. We named this analysis strategy ‘gene-clustered method’. In the ISC dataset, 21 such gene-clusters were identified in 32 pathways, while in the GAIN dataset, 16 such gene-clusters were identified in 24 pathways.

Figure 1

Illustration of gene clustering process in the generalised additive model for genome-wide association studies. When three or more genes in a pathway share the most significant single-nucleotide polymorphism (SNP), they are clustered at both the gene level and the pathway level to reduce the overlapping problem.

Correction for the complex structure in the MHC region

Because the major signals of the two schizophrenia GWAS datasets are located in the major histocompatibility complex (MHC) region, genes in this region may introduce ‘repeat’ significance information in pathway enrichment analysis, especially when they are in strong LD. To reduce the potential repeat signals, we extended our clustering strategy above and clustered genes in the MHC region by applying the rules that not only genes sharing the exact same smallest SNP should be clustered, but also genes in high LD regions. We thus extracted 357 genes located on chromosome 6, ranging from 20 000 000 to 40 000 000 in the ISC dataset, and obtained the most significant SNP for each of these genes (214 unique SNPs). We computed pair-wise r2 values between any pair of the 214 unique SNPs using PLINK. SNPs with r2>0.2 were grouped together, and groups sharing SNPs were merged until no group could share any of the SNPs in the region. This resulted in 23 gene clusters. After this process, we had a gene pool of 19 627 records (including both genes and gene-clusters) for pathway analysis, in which genes outside of the MHC region were clustered if they shared the same most-significant SNP, while genes within the MHC region were further clustered if they also shared SNPs with r2>0.2. We named this strategy ‘gene- and LD-clustered in MHC’ in the subsequent analysis. The same strategy was applied to the GAIN GWAS dataset; we had a pool of 19 504 genes after clustering the MHC region.

Comparison with other methods

We compared two other methods that are applicable using the association summary data (p values of SNPs) in GWAS: the standard hypergeometric test and ALIGATOR.5 Note that although there are many methods for pathway enrichment analysis, most require raw genotyping data and are not comparable here. In the standard hypergeometric test, the underlying assumption is that each gene has an equal probability to be detected as significant; thus, SigGenes will follow a hypergeometric distribution, and, accordingly, the hypergeometric test (Fisher's exact test) can be applied. We performed resampling in the summarised GWAS data by selecting the same number of SigGenes for each resample, conducting the hypergeometric test, and computing an empirical p value using equation (1) for each pathway.

ALIGATOR uses a resampling based strategy using all SNPs on the chip. It also defines a list of SigGenes. For each resample, it keeps drawing SNPs randomly from the list until the number of the corresponding significant genes covered by these SNPs is the same as the observed number of significant genes. In our comparison, we computed an empirical p value using equation (1) to estimate the enrichment for each pathway. Because ALIGATOR resamples SNPs instead of genes, we could execute it on the SNP level. However, when we applied either the gene-clustered or LD-clustered strategy, both of which are on the gene level, we could not apply ALIGATOR.


The ISC dataset included 3322 cases and 3587 controls, with a genomic inflation factor of 1.09 using the CMH test. The total number of SNPs in this dataset is 739 995, covering 19 910 genes. After mapping to pathways, 4697 genes could be annotated with the KEGG pathway information. For the GAIN dataset, a total of 1158 cases and 1378 controls were included for our analysis after quality control. The genomic inflation factor for the GAIN dataset was 1.07 using the trend test. Similarly, a total of 19 739 genes were represented, 4659 of which had annotated information in the KEGG pathways.

We defined different studying scenarios depending on different adjustment factors (gene length or number of SNPs per gene), different cut-off values to define SigGenes (0.05 or 0.01) or different datasets (ISC or GAIN). This generated a total of eight scenarios for our implementation of gamGWAS (2 adjustments × 2 cut-off values × 2 datasets). In each scenario, the generalised additive model could generate a well fitted curve to estimate the proportion of significant genes versus either gene length or the number of SNPs per gene (supplementary figures 2 and 3). We listed the top 10 ranked pathways by the four methods (gamGWAS adjusting gene length, gamGWAS adjusting number of SNPs per gene, hypergeometric test and ALIGATOR) in tables 1 and 2 for ISC and GAIN datasets, respectively, while a full list of top pathways are provided in supplementary tables 1 and 2 for each scenario.

Table 1

Results of pathway enrichment analysis of the ISC dataset

Table 2

Results of pathway enrichment analysis of the GAIN dataset

gamGWAS effectively reduced the correlation between pathway p value and gene length

We first explored the correlation between the nominal p value and the median gene length of each pathway using Spearman's rank correlation coefficient (r). For each pathway, the median value of gene length was computed using the genes that are used in practice—that is, genes within the pathway and also represented by the corresponding GWAS dataset. Thus, the median value might be slightly different in the ISC and the GAIN datasets. As shown in figure 2, while the standard hypergeometric test showed a strong correlation between pathway p values and median gene length (−0.27 ∼ −0.31), gamGWAS substantially reduced this correlation to |r|<0.15 for ISC and to |r|<0.14 for GAIN. Note that the negative correlation in figure 2 indicates that pathways with longer genes had smaller p values. The r values from ALIGATOR were positive in all cases (eg, r>0.1) but varied greatly across datasets. For gamGWAS, the correlations were similar when we adjusted for gene length or adjusted for the number of SNPs per gene.

Figure 2

Correlation between pathway p values and median gene length. Four methods were tested for correlation in two schizophrenia genome-wide association studies (GWAS) datasets (International Schizophrenia Consortium (ISC) and Genetic Association Information Network (GAIN)) using two p value cut-offs (0.05 and 0.01) to define significant genes. In the generalised additive model for GWAS (gamGWAS), two adjusted factors were applied: by gene length (GL) and by number of SNPs in a gene (nSNP).

Overall characteristics of the enriched pathways

Overall, the significant pathways identified by the three methods overlapped greatly in the ISC or GAIN GWAS datasets, with gamGWAS generating a medium number of significant pathways (supplementary figure 4). The hypergeometric test produced the largest list of significant pathways, but our further examination found that many of them are pathways with long genes (right skewed in figure 3). Conversely, gamGWAS or ALIGATOR produced a smaller list of significant pathways and their average gene length was also much shorter, especially in those pathways identified by ALIGATOR. Therefore, the significant pathways found by the hypergeometric test were inconclusive considering the long gene bias. Overall, gamGWAS and ALIGATOR could effectively reduce the gene length bias and prevent pathways with many long genes from being identified as significant.

Figure 3

Distribution of significant pathways identified by the hypergeometric test (red), the generalised additive model for genome-wide association studies (gamGWAS) (blue: adjusting gene length; purple: adjusting number of single-nucleotide polymorphisms (SNPs)), and ALIGATOR (green). The background is the distribution of all KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways by their median gene length.

gamGWAS highlighted the immune system and confirmed the CAMs pathway

Interestingly, the common pathways identified by all three methods are uniquely related to the immune system (tables 1 and 2, supplementary tables 1 and 2), including the pathways of systemic lupus erythematosus (hsa05322), allograft rejection (hsa05330) and graft-versus-host disease (hsa05332) in ISC, and pathways of asthma (hsa05310) and primary immunodeficiency (hsa05340) in GAIN. These pathways belong to the category of ‘human disease → immune system diseases’ in the KEGG database. Although these pathways have no obvious functional link with schizophrenia, evidence of involvement of the immune system in schizophrenia has been emerging during the past several years. Importantly, when we specifically examined the genes involved in these pathways, we found many genes that are located in the MHC region. The MHC region is the major discovery from the combined analysis of the three largest GWAS for schizophrenia,7 8 16 including the ISC and GAIN datasets used in this study, based on single marker analysis. We further confirmed this finding on the pathway level. After successfully adjusting for the gene length bias, these pathways were prioritised from a long list of pathways tested.

It is important to note that the CAMs pathway (hsa04514) was frequently identified in 9 out of 16 scenarios (four methods by two datasets by two cut-off values, supplementary figure 4). The CAMs pathway was found to be statistically significant in a recent pathway-based analysis using the SNP ratio test4 on the same two GWAS datasets, as well as by the hypergeometric test. The results from gamGWAS further support the hypothesis that the CAMs pathway is associated with schizophrenia, for which gene length bias has been controlled. Interestingly, the CAMs pathway also includes a large number of genes from the MHC region (supplementary figure 5). We thus postulated that these MHC genes might be the driving factor in these pathways.

Results are robust after reducing the potentially correlated genes in the MHC region

To investigate whether the results were possibly inflated by the complex LD structure in the MHC region, we performed three additional analyses to validate these results. First, to reduce potential correlation between genes in the MHC region, we further clustered the genes in chromosome 6 in the range of 20 000 000 to 40 000 000 bp, where the MHC region locates, by grouping nearby genes if their representative SNPs had r2≥0.2. In this LD-clustering process, many significant SNPs located in the MHC region were reduced to one signal and many genes in the MHC region were reduced as gene clusters in pathway analysis. As expected, we observed a decrease of significance in pathways using this LD-clustered data, but some immune related pathways still stood out (supplementary tables 1 and 2).

Second, we computed gene-wise p values using the VEGAS method.10 The Pearson correlation coefficient between gene-wise p values computed by VEGAS and gene length was 0.0083 in the ISC dataset, compared to −0.2022 when using the most significant SNPs for each gene. Thus, the long-gene bias was well adjusted at the gene level using the VEGAS method. We then performed unweighted resampling using the VEGAS p values to identify top ranked pathways. Our application of this strategy to the ISC dataset showed that there were eight overlapped pathways among the top 10 ranked pathways identified by gamGWAS (adjusting either gene length or the number of SNPs per gene) and by the VEGAS-based strategy, indicating that gamGWAS indeed adjusted the gene-length bias at the pathway level and the LD structure did not substantially affect the results.

Finally, we generated 100 permutation GWAS datasets by swapping the case/control labels and then performed the pathway analysis for each permutation resample. For each resample set, we recorded the rank of each pathway. Then, we computed the average rank of each pathway over the 100 permutation datasets. The results did not show a substantial proportion of MHC-pathways or immune-related pathways to be frequently top ranked in the permutation data (supplementary tables 3–8).


We proposed the generalised additive model and a weighted resampling strategy in pathway enrichment analysis of GWAS datasets specifically by adjusting the gene length bias. Interestingly, our application to two schizophrenia GWAS datasets revealed several immune related pathways that were top ranked. These pathways included several MHC genes including HLA-DOA, HLA-DOB, HLA-A, HLA-B and HLA-C. Because the MHC region harbours most significant association variants conferring risk of schizophrenia according to traditional single-marker analysis of GWAS datasets,7 8 16 our results demonstrate that gamGWAS is efficient to detect a genetic signal using only the summary association information (p values of SNPs) in GWAS. While gamGWAS could confirm the signals reported in single marker analysis, this specific signal (MHC) was largely missed in previous pathway-based reports.3 4 Importantly, pathways like the CAMs were not reported in single marker analysis but could be detected in both a previous pathway-based study4 and gamGWAS, further implicating the power of gene set analysis. Additionally, gamGWAS found a few novel pathways, including allograft rejection and autoimmune thyroid disease, consistent with the previous MHC and CAMs findings. Finally, gamGWAS could remove many pathways that might be identified as significant by the hypergeometric test because of gene length biases (supplementary figure 4). Note that we could apply gamGWAS to a very complex disease (schizophrenia). Based on this demonstration, we expect that it can be similarly applied to other complex diseases.

Compared to the standard linear regression model, the generalised additive model allows non-linear estimation of the probability of a gene being selected as significant as a function of its gene length. This introduces more flexibility and better fitting. In our application, we required monotonic spline with six knots for the fitting. We examined the effects of more knots on the fitting curve, but we found little, if any, improvement by increasing the number of knots (data not shown). Thus, the parameters used in this report seem optimal for the problem. Of note, the GAM typically refers to multiple predictors, but our work only used one predictor (gene length or the number of SNPs). This GAM-based framework can be expanded in future by adding more predictors, such as the number of SNPs after considering LD and recombination hotspots.

Although our aim was to adjust gene length, we explored solving the biases caused by either gene length or the number of SNPs per gene. While gene length is fixed in each build of the gene annotation database, the number of SNPs per gene relies on different genotyping experiments (or genotyping designs), the strategy of mapping SNPs to genes (eg, 20 kb extension or 5 kb extension of gene region) or the local LD structure. Importantly, the number of SNPs per gene has fewer variables than gene length, because many genes may have the same number of SNPs in a study. For example, for the 19 840 genes in the ISC dataset (after clustering), there were 16 914 ‘unique’ values measured by gene length, but only 322 ‘unique’ values by the number of SNPs per gene. This will greatly reduce the accuracy of the overall fitting curve in GAM when the number of SNPs per gene is applied. Nevertheless, we attempted to adjust either gene length or the number of SNPs per gene, or both. The results showed less improvement when adjusting both. Thus, we suggest the user should adjust either of them in real application.

In a pathway based analysis of GWAS data, another important source of bias arises from the size of the pathway, for example the number of genes included in a pathway. To evaluate whether gamGWAS is robust to pathway size (or, more generally, gene set size), we computed the Pearson correlation coefficient between the p values of pathways and pathway sizes measured by the number of genes. As shown in supplementary table 9, the correlation coefficient tended to be small when using gamGWAS and ALIGATOR, especially gamGWAS, while moderate correlation was observed when using the hypergeometric test. Supplementary figure 6 shows the plots of the number of genes per pathway versus the pathway p values using the ISC GWAS dataset and cut-off p value for SigGenes 0.05. Overall, the bias caused by pathway size in gamGWAS analysis is weak.

One advantage of gamGWAS is that it does not require genotyping data. It can be executed using only the summarised GWAS results. This feature may substantially increase its usefulness, as the raw GWAS genotyping data is typically not publically available due to privacy concerns. Note that we used the original genotyping data in this report because we needed to specifically reduce the impact of the MHC region, a locus with previously reported signals in schizophrenia.

When comparing our gamGWAS with other methods, we selected two methods (hypergeometric test and ALIGATOR) that also do not require raw genotyping data, although we noticed that the hypergeometric test is known for disadvantages while ALIGATOR is more sophisticated. There are other methods, such as the GRASS method,17 that require raw genotyping data but are not affected by gene length biases. Our application of GRASS to a subset of the ISC dataset confirmed very weak correlation between the pathway p values of the GRASS method and the median gene length of pathways (data not shown). However, the implementation of the GRASS method requires genotyping data. For a large GWAS dataset like the ISC dataset, running GRASS requires a very large physical memory and CPU and is very computationally intensive.

Because of the previously reported signals in the MHC region in schizophrenia, we performed additional analyses to prove the effectiveness of the pathways identified by gamGWAS. The complex structure of the MHC region resulted in many genes sharing the same association signal. While we could use gene-cluster or LD-cluster, or both, to correct this problem, it also reduced the overall significance of most interesting pathways. Alternatively, we first adjusted the bias at the gene level using the VEGAS method and then performed a pathway enrichment analysis. The results were similar to our approach of adjusting at the pathway level, indicating that gamGWAS could effectively reduce the bias. Note that VEGAS relies on the original GWAS dataset, while only summary data is needed in gamGWAS. Furthermore, for estimating gene-wise p values, there are several other methods available, such as the rank truncated product (RTP) method,18 the adaptive RTP method (ARTP),19 and the truncated product method. We applied VEGAS in this study because it could use HapMap data to compute LD and does not require genotyping data. Alternatively, we performed pathway enrichment tests using the ARTP method. We observed a total of nine nominally significant pathways in a subset of ISC samples and 27 nominally significant pathways in the GAIN dataset, but none of them could survive after the multiple testing correction (data not shown).

In conclusion, we have proposed a new strategy, gamGWAS, to perform gene set enrichment analyses that specifically adjust for gene length bias. Our results not only confirmed previous findings but also highlighted several novel immune related pathways. gamGWAS does not require genotype data or permutation of sample labels in the original GWAS data. Therefore, it greatly reduces the computational burden and is easily applicable.


We would like to thank Dr Simon Wood for his invaluable help with using the R package mgcv and Dr Lin S Chen for her instruction on using the SNPath package (GRASS).

Appendix 1 Members of the International Schizophrenia Consortium

Trinity College Dublin—Derek W Morris, Colm T O'Dushlaine, Elaine Kenny, Emma M Quinn, Michael Gill, Aiden Corvin; Cardiff University—Michael C O'Donovan, George K Kirov, Nick J Craddock, Peter A Holmans, Nigel M Williams, Lucy Georgieva, Ivan Nikolov, N Norton, H Williams, Draga Toncheva, Vihra Milanova, Michael J Owen; Karolinska Institutet/University of North Carolina at Chapel Hill—Christina M Hultman, Paul Lichtenstein, Emma F Thelander, Patrick Sullivan; University College London—Andrew McQuillin, Khalid Choudhury, Susmita Datta, Jonathan Pimm, Srinivasa Thirumalai, Vinay Puri, Robert Krasucki, Jacob Lawrence, Digby Quested, Nicholas Bass, Hugh Gurling; University of Aberdeen—Caroline Crombie, Gillian Fraser, Soh Leh Kuan, Nicholas Walker, David St Clair; University of Edinburgh—Douglas HR Blackwood, Walter J Muir, Kevin A McGhee, Ben Pickard, Pat Malloy, Alan W Maclean, Margaret Van Beck; Queensland Institute of Medical Research—Naomi R Wray, Peter M Visscher, Stuart Macgregor; University of Southern California—Michele T Pato, Helena Medeiros, Frank Middleton, Celia Carvalho, Christopher Morley, Ayman Fanous, David Conti, James A Knowles, Carlos Paz Ferreira, Antonio Macedo, M Helena Azevedo, Carlos N Pato; Massachusetts General Hospital—Jennifer L Stone, Douglas M Ruderfer, Manuel AR Ferreira; Stanley Center for Psychiatric Research and Broad Institute of MIT and Harvard—Shaun M Purcell, Jennifer L Stone, Kimberly Chambert, Douglas M Ruderfer, Finny Kuruvilla, Stacey B Gabriel, Kristin Ardlie, Mark J Daly, Edward M Scolnick, Pamela Sklar.


View Abstract

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement:


  • * Members of the consortium are listed in appendix 1.

  • Funding This work was partially supported by National Institutes of Health grant MH083094, 2009 NARSAD Maltz Investigator Award (to ZZ), and 2010 NARSAD Young Investigator Award (to PJ).

  • Competing interests None.

  • Ethics approval Not applicable because this study is not classified as human subjects research. For the GAIN dataset, the genotyping of samples was provided through the Genetic Association Information Network (GAIN) and obtained from the database of Genotype and Phenotype (dbGaP) found at through dbGaP accession number [phs000021.v2.p1] (data access request and approval #4532-2). For the ISC dataset, the genotyping of samples was approved by Institutional Review Boards in the original study (Purcell et al. Nature 2009;460:748–52).

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data sharing statement There is no additional unpublished data from this study. All the data used in this work were from previous publications.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.