Introduction

Variants affecting risk of disease may be individually too rare to generate statistically significant results in caseā€“control studies and so a burden test may be performed to assess whether there is an excess among cases of particular categories of variant within a gene or set of genes, the variants typically being included based on rarity and predicted function [1, 2]. Rather than exclude less rare variants, one may perform a weighted analysis in which common variants are included but are accorded less weight less than rare ones [3]. This approach assumes that the variants studied in general increase rather than decrease risk of disease and is contrasted to approaches such as c-alpha, SKAT and SKAT-O, which assume that variants may be either protective or deleterious [4,5,6,7]. These and other approaches to rare variant analysis have been reviewed elsewhere [8]. We have previously developed a method that provides weights for variants based both on rarity and predicted functional effect [9, 10]. The method performs a weighted burden analysis to test whether, in a particular gene or set of genes, variants that are rarer and/or predicted to have more severe functional effects occur more commonly in cases than controls. Each variant is assigned a weight based on its rarity and predicted function and then an overall gene-wise risk score is allocated to each subject consisting simply of the sum of the weights of the variants, which are found in that subject. These scores for cases and controls are then compared by carrying out a t-test. The t-test is rapid to compute and statistically robust, and the method performed acceptably when applied to real samples. However, a t-test does not incorporate information from covariates and there are two reasons why this is an important limitation. The first reason is that if the samples are ancestrally heterogeneous then artefactual results can be obtained, and that using ancestry principal components as covariates might mitigate this problem. The second reason is that other measurable genetic and non-genetic factors might be known to contribute to risk of disease, and that incorporating these as covariates might be expected to enhance the accuracy of the analysis. Three obvious kinds of genetic risk to consider are the polygenic risk score (PRS), the presence of known pathogenic copy number variants (CNVs) and the presence of known pathogenic sequence variants. To use schizophrenia as a concrete example, there is evidence that certain CNVs greatly increase risk but these are very rare, even among cases [11]. Likewise, very rare variants causing loss of function (LOF) of a small number of genes substantially increase risk [12, 13]. Statistical evidence demonstrates that rare, damaging variants in additional genes also affect risk, and that these genes are concentrated in particular gene sets, although the individual genes contributing to this effect are yet to be identified [14, 15]. Finally, cases tend to have a higher PRS, reflecting the combined effect of many common variants, widely distributed and individually having very small effects on risk [16]. When carrying out a caseā€“control study of exome sequence data in order to detect associations with rare, damaging variants, it might be reasonable to suppose that cases with pathogenic CNVs or sequence variants might be unlikely to possess additional rare risk factors. Likewise, a case with a very low PRS might be thought more likely to possess some additional risk factor than one whose PRS is very high. Thus, it seems desirable to incorporate information about different kinds of risk factor jointly when possible. Accordingly, software was developed which would compare the gene-wise risk scores using logistic regression analysis so that any desired covariates could be included.

Methods

The previously described SCOREASSOC programme was modified to carry out logistic regression analysis [9, 10]. It accepts as input genotypes of variants within a gene for cases and controls, with each variant assigned a weight according to its annotation as obtained using VEP, PolyPhen and SIFT [17,18,19]. The functional weight is then multiplied by a weight for rarity, so that rarer variants are assigned higher weights. For each subject, a gene-wise risk score is derived as the sum of the variant-wise weights, each multiplied by the number of alleles of the variant which the given subject possesses. If a subject is not genotyped for a variant then they are assigned the subject-wise average score for that variant. The programme was modified to accept as additional input an arbitrary number of quantitative covariates for each subject, typically population principal components, PRS and an indicator variable denoting whether or not the subject possesses a known pathogenic CNV or sequence variant. The score and covariates are entered into a standard logistic regression model with caseā€“control status as the outcome variable and after variable normalization the likelihood of the model is maximized using the Limited-memory Broydenā€“Fletcherā€“Goldfarbā€“Shannon (L-BGFS) quasi-Newton method, implemented using the dlib library [20]. The contribution of different variables to risk is assessed using standard likelihood ratio tests by comparing twice the difference in maximized log likelihoods between models with and without the variables of interest. This likelihood ratio statistic is then taken as a chi-squared statistic with degrees of freedom equal to the difference between models in number of variables fitted. The coefficients for each variable can be varied to maximize the likelihood or can be fixed. For example, if it is known that a particular CNV is associated with a tenfold increase in risk then the coefficient can be set to ln(10) to reflect this, rather than fitting it from the available dataset, which may contain fewer subjects than those used to produce the original risk estimate.

Preliminary analyses indicated that simple logistic regression could produce extreme p-values if a gene had only a single very rare variant found in only one or two cases. This seems to arise because subjects with unknown genotypes are then assigned the average score for this variant. This average score is very low but non-zero, whereas subjects who are genotyped will have scores of zero. If more cases than controls have an unknown genotype then the maximization routine would overfit the model and would assign a very high value to the score coefficient and would produce an apparently significant likelihood ratio statistic. To address this, the ridge penalty function, consisting of the sum of the squares of the regression coefficients, was subtracted from the log likelihood. This satisfactorily prevented the artefactual extreme p-values without preventing the ability to fit the model to produce the expected large coefficients for covariates such as principal components and the PRS.

The programme outputs the coefficients for the fitted models along with their estimated standard errors and the results of the likelihood ratio test. When association with the gene-wise risk score alone is tested, i.e. when the two models differ only in whether or not the score is included, then the statistical significance is summarized as a signed log p-value (SLP), which is the log base 10 of the p-value given a positive sign if the score tends to be higher in cases and negative if it tends to be lower. For other analyses, the minus log base 10 of the p-value (MLP) is the output. The support programme for SCOREASSOC, called GENEVARASSOC, was also modified to facilitate incorporating the covariates and specifying the desired analyses. Both are implemented in Cā€‰+ā€‰+ and can be downloaded along with documentation from the site listed below.

Example application

The approach was applied to whole exome sequence data from the Swedish schizophrenia study, consisting of 4968 cases and 6245 controls [14]. The sequence data was downloaded as a Variant Call Format (VCF) file from dbGAP (https://www.ncbi.nlm.nih.gov/gap). One aspect of special interest about this dataset is that although it was recruited in Sweden some subjects have a substantial Finnish component to their ancestry and that this applies more to cases than controls. It was analysed previously using SCOREASSOC to carry out a weighted burden test, but to do this it was first necessary to remove the subjects with Finnish ancestry because otherwise some genes produced false positive results [15]. To obtain the gene-wise risk scores the same methods were used as for this previous analysis. Variants were excluded if they did not have a PASS in the VCF information field and individual genotype calls were excluded if they had a quality score <ā€‰30. Sites were also excluded if there were >ā€‰10% of genotypes missing or of low quality in either cases or controls or if the heterozygote count was smaller than both homozygote counts in both cohorts. Each variant was annotated using VEP, PolyPhen and SIFT [17,18,19]. GENEVARASSOC was used to generate the input files for SCOREASSOC and the default weights provided with the software were used, e.g., consisting of 5 for a synonymous variant and 20 for a stop gained variant, except that 10 was added to the weight if the PolyPhen annotation was possibly or probably damaging and also if the SIFT annotation was deleterious. The full set of weights used is shown in TableĀ 1. SCOREASSOC also weights rare variants more highly than common ones but because it is well-established that no common variants have a large effect on the risk of schizophrenia we excluded variants with minor allele frequency (MAF)ā€‰>ā€‰0.01 in the cases and in the controls, so in practice weighting by rarity had negligible effect.

Table 1 The table shows the weight that was allocated to each type of variant according to its annotation by VEP [17]

To obtain population principal components, the genotypes were thinned to include only variants present on the Illumina Infinium OmniExpress-24 v1.2 BeadChip (http://emea.support.illumina.com/downloads/infinium-omniexpress-24-v1 ā€“ 2-product-files.html) and then version 1.90beta of plink (https://www.cog-genomics.org/plink2) was run with the options --pca header tabs --make-rel [21,22,23]. In order to obtain a PRS for schizophrenia, the file called scz2.prs.txt.gz, containing ORs and p-values for 102,636 single-nucleotide polymorphisms (SNPs), was downloaded from the Psychiatric Genetics Consortium (PGC) website (www.med.unc.edu/pgc/results-and-downloads). This training set was produced as part of the previously reported PGC2 schizophrenia genome-wide asoociation studies [16]. SNPs with p-valueā€‰<ā€‰0.05 were selected and their log(OR) summed over sample genotypes using the --score function of plink 1.09beta, in order to produce a PRS for each subject [21,22,23].

Attempts were made to use allele depth information in the VCF file to call all the CNVs with odds ratio (OR) reported to be >ā€‰9 as listed in TableĀ 1 of a recent study of over 40,000 subjects [11]. A deletion was called if there was a relatively low read depth over the region (compared with the expected depth across subjects for each variant and across variants for each subject) and if there were very few heterozygote calls. A duplication was called if there was a relatively high number of reads and if the heterozygote calls tended to occur with allele ratios of 2:1 or 1:2 instead of 1:1. To accomplish this, a two-stage process was used, consisting of an automated short-listing of subjects followed by visual inspection of detailed results for these shortlisted subjects. To carry out the short-listing process, for each variant a likelihood for the observed allele depths was calculated according to the copy number being 0 (deletion), 1 or 2 (duplication) and from this a log likelihood ratio statistic (LRS) favouring either deletion or duplication was derived. Then for each subject, a t-test was used to compare the LRS of variants within the CNV region to those outside it and another t-test was used to compare the overall depth of variants inside and outside the CNV region. The shortlisted subjects consisted of the 30 subjects having test results most strongly, suggesting a deletion or duplication, along with 30 subjects having the lowest or highest average difference in depths, producing a possible maximum of 120 subjects. For these subjects, graphs were produced showing the LRS and moving average LRS along with the allele ratios and moving average depths. CNV calls were made blind to phenotype based on visual inspection of these graphs and a judgement as to whether the overall pattern seemed consistent with the presence of a deletion or duplication. It can be difficult to detect CNVs from exome-sequence VCF files, because some regions will not be covered at all and because depth information is only provided for positions where a variant allele is observed. There was insufficient information to call the CNV at 9:831690ā€“959090 and no subjects were called with a CNV at 3:197230000ā€“198840000, although the latter is very rare and it may be that it was not present. The number of calls that were made in cases and controls is shown in TableĀ 2. Although there is a marked excess of CNV calls among cases at 16:29560000ā€“30110000 and 22:17400000ā€“19750000, this is not the case for the other locations and it assumed that many of these calls may be erroneous. The intent of the current study is simply to demonstrate the feasibility of the analytic approach, so errors in the CNV calls are not regarded as especially problematic. However, because of the unreliability of the calls it was decided not impose a fixed effect for the CNVs but to fit the effect size as observed in this dataset. For brevity, these CNVs having large effects of risk, along with LOF sequence variants having large effects on risk, will be referred to as pathogenic.

Table 2 Counts of CNV calls in cases and controls for those CNVs with ORā€‰>ā€‰9 in TableĀ 1 from a recent study of CNVs in schizophrenia [11]

Based on previous findings, it was decided to regard LOF sequence variants in SETD1A, RBM12 or NRXN1 as pathogenic [11,12,13, 24, 25]. No subjects had a LOF variant in NRXN1 or RBM12. However, two cases had LOF variants in SETD1A, with one having a stop gained variant and the other a splice acceptor variant.

Likelihood ratio tests were carried out to assess the association of the principal components with caseness and then, using the principal components as covariates, to assess the association of the PRS and of the pathogenic CNVs and sequence variants.

For each gene, the following analyses were carried out to test the association of the gene-wise risk score with caseness:

A t-test comparing scores in cases and controls.

Logistic regression analysis of the scores with no covariates.

Analysis of the scores including principal components as covariates.

Analysis of the scores including principal components and PRS as covariates.

Analysis of the scores including principal components and pathogenic variants as covariates.

Analysis of the scores including principal components, PRS and pathogenic variants as covariates.

In order to test whether particular sets of genes demonstrated association with caseness, the gene-wise scores for all the genes in a set were summed to produce a set-wise risk score using the PATHWAYASSOC programme [10]. Then the same analyses as listed above were carried out using the set-wise risk scores. Two lists of gene sets were used. The first list consisted of the 31 gene sets which had been previously tested for an enrichment for damaging or disrupting ultrarare variants in cases in this dataset [14]. The second list consisted of the 1454 ā€˜all Gene Ontology (GO) gene sets, gene symbolsā€™ pathways downloaded from the Molecular Signatures Database at http://www.broadinstitute.org/gsea/msigdb/collections.jsp [26].

Results were managed and graphs produced using R [27].

Results

Carrying out the logistic regression analyses is notably slower than performing a t-test and to carry out all the analyses takes a few minutes for each gene, so that analyses of all 22,021 genes took a couple of days to complete on a computer cluster.

The likelihood ratio test comparing models with or without the first 20 population principal components showed that ancestry was strongly associated with caseness (Ļ‡2ā€‰=ā€‰374, 20 df, MLPā€‰>ā€‰35). Using these principal components as covariates, both the PRS (Ļ‡2ā€‰=ā€‰156, 1 df, MLPā€‰=ā€‰35) and the presence of a pathogenic CNV or sequence variant (Ļ‡2ā€‰=ā€‰39.6, 8df, MLPā€‰=ā€‰5.4) were also associated with caseness.

Fig.Ā 1 shows the QQ plots obtained for the different gene-wise analyses. The simple t-test comparing gene-wise risk scores has a clear excess of positive SLPs above the chance expectation. The most extreme of these is for the gene for catecholamine methyl transferase, COMT, SLPā€‰=ā€‰7.4. As discussed previously, this gene-wise result is largely driven by SNP rs6267, which is heterozygous in 51/6242 controls and 94/4962 cases (ORā€‰=ā€‰2.3, pā€‰=ā€‰8ā€‰Ć—ā€‰10āˆ’7) and this reflects the fact that variant is much commoner in Finns than in non-Finnish Europeans [15]. One gene, CDCA8, has an extremely negative result and this reflects the fact that more controls than cases have a number of very rare variants with high functional weights. The results for the logistic regression analysis of the gene-wise risk scores alone, shown in Fig.Ā 1b, are very similar to those obtained from the t-test. However, when the population principal components are included, as shown in Fig.Ā 1c, the results conform almost exactly to what would be expected by chance. The SLP for CDCA8 remains somewhat low at āˆ’ā€‰5.6 but this does not exceed the threshold for significance using a Bonferroni correction for the number of genes tested. Including the PRS, CNVs and SETD1A variants as covariates, as shown in Fig.Ā 1d-f, does not have a large impact on the overall distribution of the SLPs.

Fig. 1
figure 1

QQ plots of SLPs obtained for 22,021 genes using different methods of analysis. a Results from t-tests comparing gene-wise risk scores between cases and controls. Results for Figs.Ā 1B to 1F use logistic regression analysis of gene-wise risk scores with caseness as outcome. Analyses include the following covariates. b No covariates. c Twenty population principal components. d Twenty population principal components and PRS. e Twenty population principal components and pathogenic CNV or sequence variants. f Twenty population principal components, PRS and pathogenic CNV or sequence variants

Examining the SLPs for individual genes showed that, as can be seen from the QQ plots, including the population principal components as covariates could have a large impact. The largest effect was seen with COMT, where the SLP was reduced by 4.7, and there were an additional 6 genes for which the absolute value of the SLP reduced by >ā€‰3, and in all there were 59 genes where the magnitude of the SLP reduced by 2 or more. For all but two of these, the SLP was positive. There were other genes where the magnitude of the SLP was increased, although to a lesser extent. For only one gene, CRISP3, did this change exceed 2 and there were 24 others for which the change was 1 or greater. All these genes had a negative SLP. Thus, when including the principal components had a large effect, the effect was generally to make positive SLPs smaller and negative SLPs larger. Although there were large effects for some genes, overall the effects were fairly small and evenly balanced, so that across all genes the mean SLP was reduced by 0.088 and the mean of the absolute value of the SLP was reduced by 0.059.

Including the PRS as a covariate had only small effects on the results. The largest change in SLP was only 0.6. Considering those genes that had the largest change in the absolute value of the SLP, there was a fairly equal distribution between those with positive and negative SLPs. The average effects across all genes were almost perfectly balanced, with the mean SLP changing by only āˆ’ā€‰0.00078 and the mean absolute value of the SLP changing by 0.00016.

Including the pathogenic variants as covariates also had only small effects. For two genes, the negative SLPs decreased by 0.95 and 0.87 but in no other gene did the SLP change by >0.51. In spite of the small magnitude of the changes, there was a very striking imbalance in the way they were distributed. Out of the 200 genes with the largest increase in the absolute value of the SLP 197 had a negative SLP, whereas out of the 200 genes with largest decrease in the absolute value of the SLP 199 had a positive SLP. Thus, in the analyses for which including the pathogenic variants had the largest effect, doing so tended to reduce positive SLPs and increase negative SLPs. This imbalance was only notable for the analyses with the largest effects and overall the average effect was small so that including the pathogenic CNVs changed the mean SLP by 0.00018 and the mean absolute value of the SLP by 0.0013.

As in previous studies of this dataset, none of the individual gene-wise tests is statistically significant at the genome-wide level. TableĀ 3 shows genes with SLPā€‰ā‰„ā€‰3 from the analysis including all covariates.

Table 3 Table showing gene-wise risk score SLPs for genes achieving SLPā€‰ā‰„ā€‰3 in the analysis including all covariates

The results of the gene set analyses applied to the sets used in the previous analyses of this dataset are shown in TableĀ 4. It can be seen that many sets have a high SLP before including the PCs as covariates, demonstrating that one may be at risk of obtaining false positive results if this is not done. When all the covariates are included, the highest SLP achieved is for the set consisting of fragile X mental retardation protein (FMRP) targets, with SLPā€‰=ā€‰3.3, equivalent to pā€‰=ā€‰0.0005.

Table 4 Table showing results for the gene sets used in previous analyses [14, 15]

Fig.Ā 2 shows a QQ plot of the SLPs obtained for the GO gene sets against the expectation if the sets were independent and it can be seen that they conform fairly closely to this distribution, although with some excess of positive SLPs. In fact, the sets are not independent and genes overlap between sets so one could argue that the SLPs are somewhat higher than would be expected by chance but there is certainly no strong evidence to implicate specific sets. TableĀ 5 shows those sets achieving an SLP of ā‰„ā€‰2 in the full analysis including all covariates.

Fig. 2
figure 2

QQ plot for set-wise SLPs obtained when including all covariates for GO gene sets against the expected SLP if all sets were non-overlapping and independent

Table 5 Table showing GO gene sets achieving SLP ā‰„ā€‰2

Discussion

This study demonstrates that it possible to use a logistic regression approach to carry out a weighted burden analysis, including all variants in a gene weighted according to function and rarity, while including other important covariates. The example analysis produces results which are broadly similar to previous analyses of this dataset, although with less strong support for the previously implicated gene sets.

An important finding is that including the population principal components is able to completely control for the effect of having an excess of cases with a strong Finnish ancestry component. This means that the complete dataset can be analysed if desired. Of course, there might be arguments against doing this. If different causal variants are present in different populations then analysing an admixed population might reduce power but in general it seems advantageous to be able to incorporate ancestry as a covariate. From a theoretical point of view we might expect that incorporating known risk factors as covariates could increase the power to detect new associations. In practice, including the PRS and pathogenic CNVs, and sequence variants has only a small effect on the results obtained in the present example.

Although the main emphasis of the study is simply to demonstrate the feasibility of the approach, it is possible to speculate on why it failed to provide any new, interesting results. One problem may well be that the default scheme for weighting variants according to predicted functional effects is sub-optimal. Many criticisms of it could be made, e.g., that LOF variants and 5ā€²-untranslated region variants should be given higher weights. For the future, it would be worthwhile to make attempts to produce a weighting scheme which was informed by the increasing empirical knowledge becoming available about the kinds of genetic variation most likely to have major effects on phenotype. Another approach would be to attempt to fit the weights directly and this could easily be done within the context of logistic regression analysis if it were thought that the dataset was informative enough. A similar approach has been applied in the context of generating an exome-wide genetic risk score [28].

Incorporating pathogenic variants had little effect on the results obtained and one reason for this is that only 1% of subjects were identified as possessing such a variant. It would be reasonable to hope that as further studies discover additional variants to be pathogenic so the proportion of subjects possessing one them will increase and they will have a larger impact on the results of multivariate analyses. However, another factor to consider is that there were almost certainly errors in the CNV calls, and that had the calls been accurate then they might have had more influence on the results. This is simply a feature of using a downloaded VCF file to provide an example dataset. In a real-world association study using whole exome sequence data, one would have the raw reads available and could make more reliable calls from those, and of course one might also carry out comparative genomic hybridization analysis to obtain accurate calls.

Although the method described here does not seem to have marked benefits for this dataset, it does offer an approach that will be useful to apply as additional risk factors, both genetic and non-genetic, are characterized and need to be incorporated into analyses. An obvious example would be for a disease such as ischaemic heart disease (IHD), where rare and common genetic variants along with environmental risk factors all make substantial contributions [29]. The application of logistic regression has been described as a way to incorporate information from covariates into association studies seeking to identify variants and genes associated with disease. However, the same framework could then readily be translated into a way to characterize risk of disease for an individual. One could fit a model incorporating relevant risk factors using a study population and then take the fitted coefficients for that model and apply them to a genotyped individual subject to obtain an estimate of their level of risk. It has already been proposed that the PRS IHD for could have clinical utility in deciding whether to prescribe statins for a patient [29]. Using a model such as the one proposed would allow an overall risk score to be produced from a single combined analysis including factors such as age, gender, smoking status, ancestry, IHD PRS and rare coding variants. One can envisage in future that one might even be able to partition overall risk in clinically significant ways, e.g., with a contribution related to dyslipidemia, a contribution related to clotting and a contribution related to hypertension. This could allow the subclassification of patients to support personalized approaches to treatment interventions.

Software availability

The code and documentation for SCOREASSOC, PATHWAYASSOC and GENEVARASSOC is available from https://github.com/davenomiddlenamecurtis/scoreassoc and https://github.com/davenomiddlenamecurtis/geneVarAssoc.