Enhanced prediction of gene and missense rare-variant pathogenicity by joint analysis of gene burden and amino-acid residue position

Although rare missense variants underlying a number of Mendelian diseases have been noted to cluster in specific regions of proteins, this information may be underutilized when evaluating the pathogenicity of a gene or variant. We introduce ClusterBurden and GAMs, two methods for rapid association testing and predictive modelling, respectively, that combine variant burden and amino-acid residue clustering, in case-control studies. We show that ClusterBurden increases statistical power to identify disease genes driven by missense variants, in simulated and experimental 34-gene panel for hypertrophic cardiomyopathy. We then demonstrate that GAMs can be used to apply the ACMG criteria PM1 and PP3 quantitatively, and resolve a wide range of pathogenicity potential amongst variants of uncertain significance. An R package is available for association testing using ClusterBurden, and a web application (Pathogenicity_by_Position) is available for missense variant risk prediction using GAMs for six sarcomeric genes. In conclusion, the inclusion of amino-acid residue positional information enhances the accuracy of gene and rare variant pathogenicity interpretation. Author Summary Two statistical methods have been developed that utilize signal in the residue position of missense variants. The first is a rapid association method that tests the joint hypothesis of an excess of rare-variants and rare-variant clustering. The method, ClusterBurden, is powerful when rare-missense variants cluster in discrete pathogenic regions of the protein. It can be applied to exome-scans to discover novel Mendelian disease-genes, that may not be identified by classic burden testing. The second method is a statistical model for rare-missense variant interpretation. It provides superior predictive performance compared to generic in silico predictors by training on our large case-control dataset. The method represents a data-driven quantitative approach to apply hotspot and in-silico prediction criteria from the ACMG variant interpretation guidelines.


Introduction
It has been frequently reported that pathogenic missense variants, tend to cluster in specific regions or domains of proteins [1][2][3][4][5][6][7]. A plausible mechanism underpinning this phenomenon is the presence of multiple loss or gain-of-function variants within functionally important domains, which disrupt critical aspects of protein function [8]. Despite numerous examples of variant clustering, there have been few attempts to explicitly model variant residue position as a predictor of pathogenicity [9].
Pathogenic genes for Mendelian diseases were historically identified by linkage and candidate gene studies in multiple affected families [10]. Advances in high-throughput DNA sequencing technology allow scanning of whole-exome or whole-genome sequencing of patient cohorts to offer an alternative strategy to identify novel pathogenic genes and variants. The aggregated burden of variants in affected cases compared to healthy controls has proved to be a useful statistical test to confirm the pathogenicity of candidate genes [11] as well as identify novel putative pathogenic genes [12]. However, for genes where variant pathogenicity is not uniform, including positional information alongside burden may improve power to detect associated genes.
The American College of Medical Genetics and Genomics (ACMG) have produced general guidelines to interpret variant pathogenicity [13]. These guidelines integrate a variety of diverse data, including population frequency, functional and segregation data, and classify variants into five categories from benign to pathogenic. However, due to low counts of observed variants in case datasets, a lack of segregation data and functional evidence, or presence in control datasets, many variants fall into the category 'variant of uncertain significance' (VUS).
Hypertrophic cardiomyopathy (HCM), a relatively common disease (1 in 500 prevalence), is an exemplar for this. It is a major cause of heart disease in people of all ages [14][15] and a cause of sudden cardiac death. Eight sarcomeric genes collectively provide firm molecular diagnoses for ~27% of HCM patients, with a further ~13% of patients carrying VUS in the same genes. It has been suggested that disease and genespecific approaches are needed to improve interpretation [16] and guidelines have been produced for specific genes and/or disease areas [17][18][19][20][21]. Missense variant clustering is a gene-specific metric that falls under the ACMG evidence category PM1('mutational hotspot'). However, there has previously been limited data to define these 'hotspots' quantitatively. Therefore this category is only used subjectively for a limited number of genes.
Here we introduce two new statistical methods to aid in the identification of novel causal genes and reassess the pathogenicity of variants in well-established disease genes. We show that information on a variants' amino-acid position can usefully augment power to detect novel pathogenic genes compared to simpler, mutation burden-based tests. We apply the methods to an extensive dataset of 5,338 HCM cases and use 125,748 gnomAD population controls [22], to visualise the landscape of burden and position signals across 34 cardiomyopathy genes. Finally, we develop and apply a flexible statistical modelling framework that can integrate variant burden with residue annotation data to predict pathogenicity potential in six well-established pathogenic HCM genes.

Methods
A computationally rapid, rare variant association test (ClusterBurden) was developed to test the joint hypothesis of an excess of rare missense variants, clustered with respect to amino-acid residue position in case-control data. This was accomplished by combining p-values from a rare variant burden test with a second binning test that detects variant clustering.

ClusterBurden -A combined rare variant burden and cluster test
A 2 x 2 contingency table was constructed to summarise variant carrier status in cases and controls. The tables, which sometimes include low (fewer than 5) or zero cell counts, are conveniently analysed by Fisher's exact test (FE). While we recognize that, as the number of observed variants is not fixed, a singlyconditioned exact test of odds-ratios (OR) is a more appropriate statistical test for case-control studies [23], FE was pragmatically used for its speed of computation and ease of implementation. As there are no known examples of a protective burden of rare exonic variants in cardiomyopathy, we consider only one-sided significance tests, where the p-value is the proportion of contingency tables with ORs higher than the observed one.
To test for distributional differences, along the length of a protein, the protein's linear sequence of aminoacid residues was split into k bins of equal length. To test the hypothesis that variants cluster in specific bins, in affected cases compared to controls, we applied a chi-squared two-sample test (hereafter BIN-test) to a k x 2 contingency table of binned variant counts in cases and controls. The BIN-test statistic is defined as: where R is the frequency in bin Ki for the cases and S is the frequency in bin Ki for controls. The test statistic B is asymptotically chi-squared distributed with k degrees of freedom. We used the k ~ n 2/5 heuristic [24] to select the number of bins (k) dependent on n, the number of observations. We compared the performance of the BIN-test with Anderson-Darling (AD) [25] and Kolmogorov-Smirnov (KS) [26] twosample distributional tests.
Fisher's method [27] was used to calculate the joint significance of the contributing burden and cluster tests by summing the natural logarithms of the two p-values, multiplied by minus two, to produce a chisquared statistic with 4 degrees of freedom. An important assumption of this method is that the two pvalues are uncorrelated; this was assessed in simulated data using Spearman's rank correlation test [28].

A combined rare variant burden and annotation prediction model
We fitted gene-specific generalized additive models (GAM) [29] implemented in the R package "mgcv" [30], to model rare missense variants in firmly-established pathogenic genes. The model was trained on disease status in our HCM-gnomAD case-control dataset and is unsupervised with respect to variant pathogenicity. The model is therefore not reliant on previous classifications and includes all rare-variants in both cases and controls to make estimates of odds-ratios. This allows us to quantify pathogenicity and uncertainty for variants, taking background variation and incomplete penetrance into account.
GAMs adaptively model linear and non-linear relationships of varying complexity, between explanatory variables (e.g. burden, residue position) and the response variable (case-control status). Non-linear terms specified as smooth functions are built from underlying basis functions whose linear combination sum to smooth the predictor fully. Automatic optimisation by penalized maximum likelihood reduces over-fitting as increased 'wiggliness' comes at a user-specified cost. GAMs therefore facilitate a flexible and parsimonious non-linear model-selection strategy to integrate rare variant burden and amino-acid annotation data.
The primary predictive features were carrier status (to model rare-variant gene-level burden) and residue position (to model clustering). Secondary predictive features included several variant prediction scores (e.g. SIFT [31]) extracted from the dbNSFP4.0 database [32]. To include the gene 'burden' in the model, non-carriers (i.e. samples without a variant in the gene) were included in the model. However, variant-level features such as amino-acid position are undefined (i.e. meaningless) for non-carriers, so a nested hierarchical model structure is required, whereby features are included in the model only as an interaction with carrier status. Carrier status as a binary indicator variable is then multiplied with the model matrix for smoothed terms, such as amino-acid position. For non-carriers, the indicator variable is zero to exclude this undefined data from the analysis.
The structure of an example GAM with three independent predictors, a [0,1] indicator variable for carrier status (car), a continuous variable for amino-acid position (aapos) and the continuous variable for SIFT score (sift_score) is as follows: Ln(P/1-P) = β0 + β1 car + s1 (aapos, by = car) + s2 (sift_score, by=car) + ε where Ln(P/1-P) is a logistic function specifying the probability of being a case (P), β0 the model intercept, β1 a linear coefficient for car, s1 a smoothed (i.e. non-linear) function for aapos, s2 a smoothed function for sift_score, by is used to generate factor smooth interactions and ε is a binomial random residual error term. GAMs were fitted using thin-plate regressions and parameters were estimated by restricted maximum likelihood (REML). A strict two-stage feature selection procedure was implemented to avoid overfitting. In stage 1, only features with a marginal p-value < 0.002 (0.05/24 for Bonferroni correction) were selected. In stage 2, backwards elimination was implemented using p-values Bonferroni corrected for the number of features selected in stage 1.
Six genes (MYBPC3, MYH7, MYL2, MYL3, TNNI3 and TNNT2) each carrying at least 20 rare missense variants in our HCM cases and with a significant positional signal, were informative for GAM analysis. Carrier status (gene burden) and amino-acid residue position (local burden) were included as primary predictive features (posGAM). A two-stage selection of secondary features (e.g. SIFT scores) was then performed to model the relationship between HCM status with the primary features and any retained secondary features (fullGAM). ORs with standard errors were computed for each variant to predict their respective strengths of association. Two other sarcomeric genes that showed evidence of excessive burden, ACTC1 and TPM1, but had insufficient evidence of clustering, so positional GAM modelling was less informative for these genes.
Our GAM models are capable of integrating pathogenicity data underpinning two ACMG criteria; PM1 (mutational hotspot) and PP3 (in silico prediction algorithms). There is currently no quantitative way to activate PM1 and PP3. For criteria PP1 and PS4, [17] propose OR thresholds for MYH7 to quantify evidence of pathogenicity as 10-30 for supporting, 30-100 for moderate and >100 for strong. Adopting the same thresholds, GAMs can compute variant-specific ORs that can quantitatively apply the PM1 and PP3 criteria, as supporting, moderate or strong evidence.
The relative performance of alternative GAM models was assessed by receiver operator characteristic (ROC) curves across ten cross-fold validations using 80:20 splits. Model predictions were stratified by manual variant classifications, obtained from an in-house database to determine correlation.

Simulated and observed data
A computationally rapid forward-time rare variant simulator was coded in R to model missense variants in proteins, clustered in pathogenic regions under high selection (S1 Methods). Briefly, demography was based on European population history, mutations followed an infinite sites model and mating was dependent on selection. Three clustering scenarios were considered: 1) a uniform pathogenicity model 2) a single cluster model equivalent to a protein with one discrete pathogenic region and 3) a multiple cluster model where the protein contains several pathogenic regions. Simulation parameters such as mutation rate were tuned to generate a simulated case-control cohort with properties comparable to weakly associated genes in our HCM-gnomAD dataset. Simulated ORs were kept intentionally low for two reasons; to ensure power was <95% for each test to facilitate comparisons of power, and to highlight that the power to detect associations for relatively low penetrance genes requires highly efficient methods.
Synthetic data were generated for 5,000 cases and 125,000 controls and variants were filtered at a minor allele frequency of <0.0001. The type 1 error and power of ClusterBurden was compared to two published position-informed association methods: DoEstRare [9] and CLUSTER [33], and three position-uninformed methods: C-alpha [34], SKAT [35], WST [36], using 10,000 replicate datasets, under the three different clustering scenarios and two different protein lengths (500 and 1,000). Power and type 1 error were calculated using the (r+1)/(n+1) estimator where r represents the number of simulated datasets with pvalues less than the 0.05 and n is the number of simulations [37]. Implementation details for all tests considered are found in S1 Table. Next-generation sequence (NGS) data for 34 cardiomyopathy genes were available for two large HCM cohorts; 2,757 probands referred to the Oxford Medical Genetics Laboratory (OMGL) for genetic testing and 2,636 HCM probands recruited to the HCMR project [38]. High coverage exonic sequences were captured by target enrichment and sequenced on the MiSeq platform (Illumina Inc.). Bioinformatic processing of NGS data followed the Genome Analysis ToolKit version 4 best practice guidelines (https://software.broadinstitute.org/gatk/best-practices/). OMGL variants were confirmed by Sanger sequencing, HCMR variants were manually checked by inspection of BAM files.
In all analyses, only missense variants were considered, defined as single nucleotide polymorphisms that cause a single amino acid residue substitution in the protein sequence. The gnomAD population reference database was used as a control group, which includes variant frequency data based on up to 125,748 individuals. Individual sample-level information is unavailable for gnomAD, so we assumed that no sample carried more than one variant in each gene. Although individual sample-level information was available for the HCM cases, the same simplification was applied. This was justified by the low proportion of case samples with multiple variants across the gene. For 5,338 samples in the combined OMGL-HCMR dataset, over 34 genes, there were an average of 0.77 samples with multiple missense variants post-filtering per gene. We define rare variants by reference to their allele frequencies in unaffected controls, which for rare Mendelian diseases is approximated by population reference databases such as gnomAD [11][12]. Rare variants were selected with a gnomAD population maximum (popmax) allele frequency less than 0.0001 to exclude potentially common, and thus unlikely to be pathogenic for HCM, ancestry-specific polymorphic variants.

ClusterBurden -a clustered rare variant burden test
Correlation between p-values from the BIN-test and FE tests were compared under the null and disease (cluster/burden) models in simulated data. For the disease model, there was an expected positive correlation (Spearman's rank correlation rho=0.398) between the p-values, as the power of both tests covary with the number of observed data points (i.e. number of rare variant carriers). However, under the null model, the p-values were completely uncorrelated, satisfying the independence assumption of the Fisher p-value combination method.   ClusterBurden was the most powerful method when clustering was present, whereas CLUSTER was most powerful under the uniform (i.e. burden-only) association model. Amongst the position-informed tests, ClusterBurden was the most rapid to compute per gene (<1 second) whereas DoEstRare took >20 minutes and CLUSTER took >4 minutes.

ClusterBurden analysis of cardiomyopathy gene panel
We examined 34 cardiomyopathy genes for rare missense variant associations with the FE (burden), BINtest (cluster) and ClusterBurden (combined cluster/burden) tests in HCM cases and gnomAD controls (Fig.  1). Significance thresholds were Bonferroni adjusted to allow for 34 genes x 3 methods (i.e. p-values adjusted for 102 tests to p < 0.00049). Significant burden signals were detected in 11 genes with FE; MYH7 and MYL3 (p < 1.7 x 10 -4 ). Two additional core sarcomeric genes did not show Bonferroni significant associations; ACTC1 (p<0.0412) and TPM1 (p< 0.0494). ClusterBurden confirmed the association for 12 genes that showed burden signals and calculated substantially lower p-values for all eight coresarcomeric genes, consistent with enhanced power for this approach.  Table  2. Due to the covariance of power to detect an association and the number of observations, genes with more variants supported the use of more features.

MYL3
Residue position 3.1x10 -4 The receiver operator characteristic (ROC) area under the curve (AUC) metric was calculated for ten crossfold validations with 80:20 splits. The two GAM models, fullGAM and posGAM, were compared to models that stratified samples into cases or controls based on individual in silico variant prediction scores. Linear thresholds for the individual score models were determined to give the maximum possible AUC for each score, conditional on the observed data. Figure 3 displays the mean and standard deviation AUC across the ten cross-fold splits for the five highest-scoring models in each gene. For all six HCM genes, the fullGAM had a higher mean AUC than any individual in silico predictor. With the exception of MYBPC3, the posGAM performed better than any in silico predictor from dbNSFP. The AUC standard deviations for MYH7 and MYBPC3 were considerably lower than the remaining genes.
There was a strong correspondence between fullGAM predictions and expert classifications made by OMGL for variants in each gene (Fig. 4). In MYH7 cases, mean predicted ORs for pathogenic, likely pathogenic and VUS variants were 74, 50 and 20 respectively. Notably, the VUS class had high heterogeneity, with predicted ORs ranging from 0.25 to 197. For 53% of these VUS's, limited information is available, as they are observed in a single case and were not present in gnomAD. Based on these observed frequencies, and after Haldane continuity correction [39], the empirical OR for each of these variants is 44

Discussion
We have developed two new analytic methods; ClusterBurden and GAMs, which incorporate information on amino-acid residue position to examine the pathogenic potential of rare coding variants in Mendelian disease genes. We apply the methods to gene panel data from HCM patients to illustrate the applications of this approach for rare variant interpretation, and for pathogenic gene discovery.

ClusterBurden
ClusterBurden is a gene association test with superior power over a standalone burden test in situations where rare pathogenic variants cluster in specific protein regions. ClusterBurden was devised to be suitable for scanning large-scale whole-exome sequencing projects designed to identify novel pathogenic genes for rare Mendelian diseases. The combination of FE and BIN-test to model clustering and burden minimizes the computation overhead required to calculate p-values making analyses of >20,000 genes [40] in hundreds of thousands of cases and controls, practical in terms of execution time and computer memory requirements. For genes with a significant BIN-test 'cluster' p-value, ClusterBurden calculated considerably lower p-values that the traditional Fisher's exact 'burden' test, implying an increase in statistical power (Table 1). Although ClusterBurden has reduced power whenever clustering is absent, we observe clustering for the majority of well-established HCM genes where missense variants cause disease, so expect that the method will often be more powerful to detect novel Mendelian genes than a burden-only test.

GAMs
We show that generalized additive models can be informative to assess pathogenicity of rare coding variants based on our study of several well-established HCM genes. We report strikingly different predicted ORs depending on where in the linear protein sequence a variant falls. For 5 out of 6 core sarcomeric genes; MYH7, TNNT2, TNNI3, MYL2 and MYL3, variant residue position relative to gnomAD population controls was the best predictor of case or control status. For MYBPC3, several bioinformatics annotation features in dbNSFP had improved predictive performance over residue position.
GAMs have attractive statistical properties that are not necessarily shared by other machine-learning approaches, in that they produce familiar interpretable results via variant-specific ORs and quantify uncertainty in estimates by 95% confidence intervals. Unlike empirical ORs that are based on the observed case-control frequencies of the given variant in isolation, GAM ORs draw upon a much larger pool of information, including in silico prediction scores and features of other variants in close proximity. These estimates implicitly combine information on both pathogenicity and penetrance, have much tighter confidence intervals (even when observed counts are very low), and could be calculated for novel singleton or hypothetical variants.
Model predictions were positively correlated with classifications made by experts at OMGL. Discordant predictions highlight variants with potential for reclassification. GAM predictions therefore offer a useful metric for stratification of rare missense variants and suggest a natural ranking to variants that can be considered for reclassification as (likely) pathogenic or (likely) benign. The GAMs developed here overlap with two ACMG criteria; PM1 and PP3. The posGAM model represents a quantitative data-driven approach to applying criteria PM1. Under the ACMG guidelines, classifications are made when a specific number of criteria have been met at different levels (e.g. one strong and two moderate for a pathogenic classification). Therefore there is potential to lose information when multiple rules are captured in a single model i.e. fullGAM. However, in most cases, the addition of supporting evidence from PP3 will not impact classification. It is therefore possible in these circumstances, to combine signals PM1 and PP3 in the fullGAM to use available information as efficiently as possible.
For genes burdened by clustered variants, GAM improved performance over other commonly used in silico predictors that are not optimized on gene-specific data. Unlike most variant prediction algorithms, including all those available in dbNSFP, where models are trained using variant pathogenicity as the response label, the GAM response variable is case status and not variant pathogenicity. This is necessary to ascribe ORs that simultaneously represent pathogenicity potential and penetrance. However, this does impose a limit on the maximum predictive accuracy for each model, as this is heavily influenced by incomplete penetrance. However, the relative increase in AUC when using this gene-specific approach, compared to generic in silico predictors was substantial.

Significant position signals in the HCM gene panel
After popmax filtering at 0.0001, we detected Bonferroni significant clustering of variants in six genes in HCM cases compared to the gnomAD reference controls; MYH7, MYBPC3, TNNT2, TNNI3, MYL2 and MYL3 (Figure 1). The strongest position signal was observed in the beta myosin heavy chain protein (MYH7: ENST00000355349), a finding that has been long recognised [41][42]. The highest variant density is observed in residues 100-900 that overlaps with the myosin-head motor domain, two peak densities in this region centre on residues 370 and 830 ( Figure 2). The relatively low variant density in cases and high density in controls in the carboxy-terminus of this protein might lead an observer to hypothesise a regional protective effect on HCM risk (S13 Figure). In sharp contrast, the GAM model predicts a modestly excessive burden (OR ~3) across this entire region discounting the likelihood of a localised protective effect ( Figure  2).
A strong position signal, driven by four potential clusters, was observed in the MYBPC3 gene (ENST00000545968), which encodes cardiac myosin-binding protein C C10 (Figure 2). Clusters peaked at residues 260, 518, 864 and 1274, which respectively fall in domains C1, C3, C7 and C10. Multiple functional roles are suspected for the region containing the C1 domain, including binding to myosin S2 and actin. The C10 domain is also a possible titin binding site [43]. To explore whether the signal was overly driven by founder mutations found at high frequencies in our cardiomyopathy cases, seven variants with allele counts above 10; p.Arg810His, p.Asp770Asn, p.Glu542Gln, p.Arg502Trp, p.Arg495Gln, p.Glu258Lys and p.Val219Leu were masked in a sensitivity analysis. In their absence, there is still strong evidence of a position signal (p < 3 x 10 -9 ) and the remaining peak densities overlap with the locations of the (masked) founder mutations (S14 Figure).

Conclusion
As the GAM modelling framework is a data-dependent approach to pathogenicity interpretation, increasing the size of the training dataset should increase the accuracy and confidence of model predictions for HCM. This is especially relevant if this model is to be integrated within the ACMG classification pipeline. With large datasets to drive these models, criteria PM1 and PP3 can be applied to confidently to variants observed in the clinic. Furthermore, this modelling approach has general application to other Mendelian diseases with sufficiently large case cohorts for a data-driven modelling approach.
Our present analytic methods assume an autosomal dominant genetic model. With sample-level information to distinguish homozygotes and compound heterozygotes, it is conceivable to extend ClusterBurden and the GAM methods to analyse a recessively inherited disease by judicious choices of indicator-variable coding. The GAM approach could also be extended from the 1-dimensional linear protein sequence to 3-dimensional protein structures, by including smoothed linear variables to model x, y and z protein coordinates. This is potentially a more informative way to model variant clustering; however it is limited by the availability of complete high-resolution 3-dimensional structures. For the HCM genes we examined, no suitably complete structures were available; as more structures are solved and possibly for other Mendelian diseases, a 3-D GAM analysis might offer further improvements in variant interpretation.
In conclusion, with the assembly of large patient and population control datasets to quantify mutation clustering, missense residue position is an important feature to consider in analyses of rare-variants in Mendelian diseases. The ClusterBurden and GAM methods have the potential to improve power to detect novel pathogenic genes and probe in detail the genetic architecture of risk variants, analyses that could improve interpretation of genetic testing to provide more reliable information to families and patients.    Classifications for each variant were manually curated by the Oxford Medical Genetics Laboratory. Risk predictions generated by fullGAM models are displayed on the x-axis on a log10 scale. Only variants with a popmax frequency of less than 0.01% were considered, excluding most variants with a benign or likely benign classification.
Supporting information captions S1 Table: Implementation details for all statistical tests used for type 1 error and power calculations in this study. S1 Methods: Description of the forward-time simulation algorithm used to simulate rare-clustered variants. Each plot shows predictions (odds-ratios) generated by GAM, trained on cardiomyopathy cases (n=5,338) and gnomAD controls (n=125,748). Each point is a variant in our hypertrophic cardiomyopathy dataset, is coloured by its expert classification (made by the Oxford Medical Genetics Laboratory), and is accompanied by a 95% confidence interval bar. The dashed red line indicates an odds-ratio of 1, the blue dashed line indicates the odds-ratio for uniform burden, and the solid black line is the marginal odds-ratio for a model of only amino-acid residue number. S13 Figure: Distribution and risk predictions for rare-missense MYH7 variants in our case-control cohort.

S1-6 Figures
The variant positions and training data for the GAM are a case cohort of 5,338 hypertrophic cardiomyopathy cases and 125,748 gnomAD controls. The density plot (lower panel) may give the impression that there is an excess of control variants in the C-terminus of the MYH7 protein; however the GAM model (upper panel) resolves this potential misinterpretation and clearly shows an odds-ratio greater than 1 for the entire protein.
S14 Figure: Rare-missense variant clustering in MYBPC3 with and without potential founder mutations.
Variant clustering model (posGAM) are generated for three different frequency filtering strategies. The model identifies four discrete regions with high pathogenic potential regardless of whether the founder mutations are included in the analysis. However, the magnitude of the predicted ORs are higher under normal filtering conditions (e.g. popmax < 0.01%).