Assessing performance of pathogenicity predictors using clinically relevant variant datasets

Background Pathogenicity predictors are integral to genomic variant interpretation but, despite their widespread usage, an independent validation of performance using a clinically relevant dataset has not been undertaken. Methods We derive two validation datasets: an ‘open’ dataset containing variants extracted from publicly available databases, similar to those commonly applied in previous benchmarking exercises, and a ‘clinically representative’ dataset containing variants identified through research/diagnostic exome and panel sequencing. Using these datasets, we evaluate the performance of three recent meta-predictors, REVEL, GAVIN and ClinPred, and compare their performance against two commonly used in silico tools, SIFT and PolyPhen-2. Results Although the newer meta-predictors outperform the older tools, the performance of all pathogenicity predictors is substantially lower in the clinically representative dataset. Using our clinically relevant dataset, REVEL performed best with an area under the receiver operating characteristic curve of 0.82. Using a concordance-based approach based on a consensus of multiple tools reduces the performance due to both discordance between tools and false concordance where tools make common misclassification. Analysis of tool feature usage may give an insight into the tool performance and misclassification. Conclusion Our results support the adoption of meta-predictors over traditional in silico tools, but do not support a consensus-based approach as in current practice.


INTRODUCTION
As the scale of genomic sequencing continues to increase, the classification of rare genomic variants is the primary bottleneck in the diagnosis of rare monogenic disorder. Guidelines published by the American College of Medical Genetics (ACMG) in 2015 1 have helped to bring greater consistency to variant classification. These have been followed by gene/disorder-specific rule-sets, 2 3 and country/ healthcare system-specific guidance such as the UK Association for Clinical Genomic Science (ACGS) best practice guidelines for variant interpretation. 4 Common to all guidelines is the recommendation of the use of in silico prediction tools to aid in the classification of missense variants. In silico prediction tools are algorithms designed to predict the functional impact of variation, usually missense changes caused by single-nucleotide variants (SNVs). Though originally designed for the prioritisation of research variants, 2 the tools are used routinely in clinical diagnostics during variant classification. The tools integrate a number of features in order to assess the impact of a variant on protein function. 3 Initially, inter-species conservation formed the bulk of the predictions, with some additional functional information, such as substitution matrices of physicochemical distances of amino acids (such as Grantham 5 or PAM 6 ), and data derived from a limited number of available X-ray crystallographic structures. 7 Since the development of the first in silico prediction tools over a decade ago, 2 7 large-scale experiments such as the ENCODE project 8 have generated huge amounts of functional data, and we now also have access to large-scale databases of clinical and neutral variation. [9][10][11] These additional sources of data have led to an explosion of new in silico prediction algorithms 12-14 that purport to increase accuracy.
However, the large increase in the number of predictors integrated into classification algorithms has raised concerns about overfitting. 15 16 Overfitting occurs when the prediction algorithm is trained on superfluous data or features that are irrelevant to the prediction outcome. 16 While it may appear that an increasingly large feature list leads to improvements in prediction, random variability within the training dataset may result in decreased accuracy when applied to a novel dataset. Overfitting can be mitigated through the use of increasingly large training datasets, and the usage of online variant databases, such as the genome aggregation database (gnomAD) 17 and ClinVar, 10 allows for sufficiently large training datasets. In addition, reliance on additional information-such as protein functional data and allele frequency data such as from gnomAD 17 -may be contrary to the standard assumptions of variant classification methodology, namely that each dataset is independent and applied only once during classification.
The 2015 ACMG guidelines recommend the use of a concordance-based approach, where several prediction algorithms are used, and evidence is applied only when there is agreement between tools. There is no guidance on which in silico tools should be used, how many or on what constitutes a consensus, and this ambiguity allows for inconsistencies in the application of this piece of evidence across clinical laboratories. Studies have previously Diagnostics identified the limitations of applying a strict binary consensusbased approach. 18 In response, multiple groups [12][13][14] have created meta-predictors; tools which integrate information from a large number of sources into a machine-learning algorithm. These tools thereby adhere to the principle of the consensus-based model suggested by ACMG without the onerous task of determining tool concordance and reduce discordance when increasingly large numbers of tools are used. Unlike a manual consensusbased model, where tools are weighted equally, meta-predictors can apply weighting to features in order to maximise accuracy. The UK ACGS guidelines 4 suggest it is likely that a single metapredictor will replace this concordance-based approach, but a comprehensive analysis using a clinically representative dataset has not yet been done.
In order to evaluate the accuracy of in silico prediction tools, precompiled variant datasets such as VariBench 19 have been designed to aid in training and benchmarking of pathogenicity predictors. However, the use of standardised datasets may introduce inherent biases into prediction algorithms, resulting in overfitting and false concordance. Typically, prediction software is trained using machine-learning algorithms, and assessed using variants available from large online public databases 2 3 7 8 12-14 20 such as ExAC/gnomAD, ClinVar 10 and SwissProt. 21 It has been previously shown that prediction algorithms have variable performance when applied to different datasets, 3 20 22 23 and therefore the use of variant datasets derived from online public databases may not be representative of the performance of tools when applied in a clinical setting. While studies emphasise the use of 'neutral' variation, the output from a modern next-generation sequencing pipeline is generally far from neutral and includes a large number of variant filtering steps in order to reduce the burden of manual variant assessment. 24 Here, we evaluate and compare the performance of two traditional in silico pathogenicity prediction tools commonly used for clinical variant interpretation (SIFT 2 and PolyPhen-2 7 ), and three meta-predictors (REVEL, 12 GAVIN 13 and ClinPred 14 ) using a publicly available ('open') variant dataset and a clinically relevant ('clinical') variant dataset (see figure 1). While a number of other tools are available, these metapredictors were selected as they were designed in the anticipation of being used in a clinical setting. We show that the tools' performance is heavily affected by the test dataset, and that all tools may perform worse than expected when classifying novel missense variants. By assessing the effect of a consensus-based approach, our results support the use of a single classifier when performing variant classification.

MATERIALS AND METHODS Open dataset
Open dataset (n=8480, see figure 1A) represents the typical training and validation dataset used during in silico predictor design and benchmarking. Positive ('pathogenic') variants were downloaded from ClinVar 10 on 13 November 2017 and subscription-based HGMD 25 Professional release 2017.3; neutral ('benign') variants in OMIM 26 morbid genes were downloaded from the gnomAD 9 database (exomes only data v2.0.1). ClinVar criteria: Stringent criteria were used to increase the likelihood of selected variants being truly pathogenic. Missense SNVs with either 'pathogenic' and/or 'likely pathogenic' classification, multiple submitters and no conflicting submissions were included; variants with any assertions of 'uncertain', 'likely benign' or 'benign' were excluded. HGMD Pro criteria: Single-nucleotide missense variants marked as disease-causing ('DM') were taken from HGMD Professional release 2017.3. gnomAD criteria: Missense SNVs with an overall minor allele frequency (MAF) between 1% and 5% were selected. These variants were deemed too common to be disease causing but are not necessarily filtered out by next-generation sequencing pipelines depending on the MAF thresholds used. Chromosomal locations with more than one variant (multiallelic sites) were excluded. Any variants found to be present in the 'pathogenic' and 'neutral' datasets were removed from both. Variants present

Clinical dataset
Clinical dataset (n=1757, see figure 1B and online supplementary table S1) more accurately reflects variants that might require classification in a clinical diagnostics laboratory following identification in an exome or genome sequencing pipeline. Variants were selected from three sources. Group 1 ('DDD') consists of pathogenic (n=687) and benign (n=533) missense variants identified from 13 462 families in the Deciphering Developmental Disorders (DDD) study that have been through multiple rounds of variant filtering and clinical evaluation. 24 27 Variants were identified through exome sequencing and were reported to the patients' referring clinicians for interpretation and confirmation in accredited UK diagnostic laboratories. All benign variants from this list were assessed as having no contribution towards the patient's phenotype, and were present in either as heterozygotes in monoallelic genes or homozygotes in biallelic genes classified according to the Developmental Disorder Genotype-2-Phenotype database (DDG2P) 28 (data accessed 17 Oct 2019). Group 2 ('Diagnostic') consisted of pathogenic (n=452) and benign (n=28) missense variants identified through Sanger sequencing, next-generation sequencing panel analysis or single gene testing in an accredited clinical diagnostic laboratory. Variants were manually classified according to the ACMG guidelines on variant interpretation 1 on a 5-point scale (data accessed 23 Apr 2019). Group 3 ('Amish') consisted of benign missense variants (n=57) identified through a Community Genomics research study of 220 Amish individuals. Variants were identified through singleton exome sequencing and were classified as benign based on population frequencies and zygosity within this study. Two subgroups were manually selected and annotated based on inheritance pattern and disease penetrance; subgroup (i) consisted of variants in genes that cause a dominantly inherited disorder with complete penetrance in childhood, for which the individual was clinically unaffected; this list was curated by a consultant in clinical genetics; subgroup (ii) consisted of variants in all other OMIM morbid genes (including those with incompletely penetrant dominant disorders and recessive and X-linked inheritance), with MAF >5% in the Amish cohort and MAF ≤0.01% in gnomAD (data accessed 18 Oct 2019). Variants with missing or intermediate scores were removed.

Transcript selection and variant annotation
For the open dataset, the canonical transcript was selected for each variant using the Variant Effect Predictor (VEP). 29 For the clinical dataset, the HGMD Professional RefSeq transcript was used, unless absent from the database, in which case the MANE primary transcript was selected. Variants were annotated with variant cDNA and protein nomenclature in reference to the selected transcript. PolyPhen-2 and SIFT scores were annotated using VEP. REVEL and ClinPred scores were annotated using flat files containing precomputed scores for all possible singlenucleotide substitutions, and in both cases, the combination of nucleotide position, nucleotide change and amino acid change was sufficiently unique to identify a single record, that is, transcript selection did not affect the scores. GAVIN scores were generated through a batch submission to the GAVIN server.

Tool benchmarking
The performance of each of the tools was determined for both datasets. For SIFT, PolyPhen-2, REVEL and ClinPred, the output of the analysis was a numerical score between 0 and 1. Initially, all tools were analysed according to the criteria defined in their original publications, with the thresholds for pathogenicity being ≤0.05 for SIFT, ≥0.9 for PolyPhen-2 and ≥0.5 for ClinPred. For REVEL, where no threshold is recommended, a threshold of ≥0.5
was used. The categorical classification of GAVIN was used directly ('Benign', 'Pathogenic'). A supplementary analysis was done for those tools with a numerical output (SIFT, PolyPhen-2, REVEL and ClinPred) to more accurately compare their performance. A unique threshold was selected for each tool to calculate the specificity when sensitivity was set to 0.9. In order to include GAVIN in this analysis, a third analysis was performed, whereby each tool's specificity was measured when the threshold was adjusted to set the sensitivity identical to that of GAVIN.

Classification of variant sources
We compared the feature list of all tools benchmarked in this study (PolyPhen-2, SIFT, REVEL, GAVIN and ClinPred) and, in the case of the meta-predictors, the tools that they use as part of their algorithm (MPC, 30 MutPred, 31 VEST, 32 CADD, 33 DANN, 34 SNPEff, 35 FATHMM, 36 FitCons 37 and MutationTaster 38 ). Features were split into five broad categories: Conservation, Genetic variation, Functional evidence (nucleotide), Functional evidence (protein) and Amino acid properties (see figure 2 and online supplementary figure S1). In general, the meta-predictors employ a wider variety of sources and are less heavily reliant on conservation alone. CADD/DANN and FitCons, and by extension GAVIN and Clin-Pred, are the only predictors with features within the Functional (nucleotide) category and are therefore able to predict the pathogenicity of a variant in the context of its nucleotide change, regardless of whether there is a resultant amino acid change.

Benchmarking predictor performance in the open and clinical datasets
Initially, each of the tools was benchmarked according to the threshold provided by the tools' authors. This analysis involved a dichotomisation of scores with no intermediate range (see table 1). The distribution of scores from SIFT, PolyPhen-2, REVEL and ClinPred is shown in figure 3 and receiver operating characteristic (ROC) curves are shown in figure 4. Of the tools with numerical outputs, ClinPred has the highest discriminatory power for the open dataset with an area under the ROC curve (AUC) of 0.993, while REVEL has the highest AUC for the clinical dataset of 0.818. The two meta-predictors outperformed SIFT and PolyPhen-2 in both datasets. In agreement with tool author benchmarking, 12-14 the meta-predictors REVEL, Clin-Pred and GAVIN were highly proficient at classifying the variants in the open dataset, achieving sensitivities of 0.87, 0.90 and 0.95, and specificities of 0.95, 1.00 and 0.98, respectively. For variants in the clinical dataset, although the sensitivity of each tool remained largely constant, the specificity of all tools dropped considerably. For REVEL, ClinPred and GAVIN, specificity is reduced to 0.60, 0.27 and 0.25, respectively (table 1).
It was apparent that the threshold suggested by the tools' authors was not well suited to both datasets, given the tools' high sensitivity but low specificity in the clinical dataset. In order to correct for this, we performed a supplementary analysis for those predictors which gave a numerical output (SIFT, PolyPhen-2, REVEL and ClinPred). Here, a variable threshold

Diagnostics
was allowed for each tool to give a common sensitivity of 0.9 (ie, pathogenic variation is called correctly 90% of the time). The threshold required to give a sensitivity of 0.9 in each tool is shown in online supplementary table S2. The specificity of each tool at the determined threshold is shown in online supplementary figure S2. When allowed a variable threshold, the tools' specificity increased significantly, with PolyPhen-2, SIFT, REVEL and ClinPred achieving a specificity of 0.67, 0.63, 0.93 and 0.99 for the open dataset, and 0.34, 0.33, 0.52 and 0.51 for the clinical dataset, respectively. In order to include GAVIN in this analysis, a third analysis was performed in which each tool was given a threshold to match the sensitivity achieved by GAVIN in each of the datasets. The specificity of all five tools is shown in online supplementary figure S3, and the sensitivity and threshold for each tool is shown in online supplementary table S3.

Use of individual tools versus a consensus-based approach between multiple tools
In accordance with current variant classification guidelines, we investigated the effect of performing a consensus-based analysis, using two commonly used tools, SIFT and PolyPhen-2, and two meta-predictors, REVEL and ClinPred, to determine whether this combined approach has improved sensitivity/specificity over the individual tools. Figure 5 shows the true concordance rate (correct classification by all tools), false concordance rate (incorrect classification by all tools) and discordance rate (disagreement between tools) for each of these tool pairings for the pathogenic and benign variants in both datasets. Within the clinically relevant dataset, the tools are either falsely concordant or discordant for ~15% of pathogenic variants but ~78% of benign variants. The sensitivity and specificity of this approach is shown in table 1. Use of a consensus-based approach may introduce a third 'discordance' category to the classification where tools disagree and no in silico evidence can be used, which applied to 21% and 16% of variants when considering the concordance of PolyPhen-2 and SIFT, and 8% and 23% when considering the concordance between REVEL and ClinPred, for the open and clinical datasets, respectively.
An alternative 'majority rule' method can instead be applied. Here, more than three tools are used, and the result agreed by >50% of tools selected. This method eliminates the 'discordance' category, as dissenting tools are ignored. Two majority-based

Figure 4
Receiver operating characteristic (ROC) curves for SIFT, PolyPhen-2, REVEL and ClinPred using two datasets. Open dataset-blue; clinical dataset-red. Generated in R using the 'roc' and 'plot.roc' functions in the 'pROC' library. Area under the ROC curve (AUC) was calculated in R using the 'roc' function. For ease of comparison, SIFT scores have been inverted.
analyses were performed using (1) all five tools (SIFT, Polyphen-2, REVEL, GAVIN and ClinPred) and (2) using only the meta-predictors (REVEL, GAVIN and ClinPred). The sensitivity and specificity of this majority-based approach is shown in online supplementary table S4. While this approach did improve on the strict concordance approach outlined previously, and is commonly applied in clinical genomics, the false concordance was still high and the highest specificity in the clinical dataset, achieved using a majority voting based approach with all five tools, was 0.32 (in contrast to the specificity of 0.60 achieved by REVEL in the same dataset).

DISCUSSION
We have compared the performance of five in silico pathogenicity predictors-two tools used routinely in variant classification (SIFT and PolyPhen-2) and three recently developed clinical meta-predictors (REVEL, ClinPred and GAVIN)-using two variant datasets: an open dataset collated using the selection strategy commonly employed when benchmarking tool performance, and a clinically representative dataset composed of rare and novel variants identified through high-throughput research and clinical sequencing with manual classification. Overall, the data herein show that meta-predictors have a greater sensitivity and specificity than the classic tools in both variant datasets. However, despite the increased accuracy of the meta-predictors, all tools performed substantially worse in the clinical dataset compared with the open dataset. This difference in tool performance illustrates the importance of considering the provenance of variants when benchmarking tools and how overfitting of a classifier to the training dataset can occur when increasingly large sets of variant features are used. The two datasets herein were constructed using very different methodologies, which determine the variants present within each. The open dataset, composed of variants derived from online repositories, is modelled on the methods commonly used when constructing test datasets. The tools performed universally well when characterising this dataset, indicating that these variants inherently possess features easily identifiable to the in silico predictors. In contrast, the clinical dataset is composed of variants identified through research and clinical next-generation sequencing pipelines, which had undergone multiple rounds of variant filtering and selection. Many variants within the open dataset would be Figure 5 Concordance between tools separated by dataset and classification (pathogenic and benign). Open dataset-blue; clinical dataset-red; pathogenic variants-top graph; benign variants-bottom graph. True concordance indicates that the tools agree and were correct. False concordance indicates that the tools agree but were incorrect. Discordance indicates that the tools disagreed on the classification.

Diagnostics
automatically filtered out of the clinical dataset, based on MAF for example, and this dataset therefore gives a more representative assessment of the performance of such tools in genomic diagnostics laboratories-an assessment not previously performed.
Our analysis suggests that REVEL performs best when classifying rare variants routinely identified in clinical sequencing pipelines, with an AUC for our clinical dataset of 0.818, followed closely by ClinPred with an AUC of 0.808 (figure 4) and with a higher specificity than GAVIN in a direct (although suboptimal) comparison (online supplementary figure S3). While the REVEL team does not suggest a strict threshold for categorisation, in our analysis for the clinical dataset, a threshold of 0.43 gave a sensitivity of 0.9, and a specificity of 0.52, which is comparable with previous studies' threshold of 0.5. 14 Current guidelines on the classification of variants indicate that evidence should only apply when multiple tools are concordant. 1 However, the use of concordance may introduce a third category to variants classification (discordance), where there is disagreement between tools and therefore the tools cannot be used as evidence to categorise the variant as either benign or pathogenic. The use of a majority-voting system appears to improve performance over a strict concordance approach, but our data show that both concordance methodologies give a lower sensitivity and specificity than the use of either of these tools in isolation, and furthermore that their performance is below that of the meta-predictors.
As with all similar studies, we were limited by the availability of novel variants absent from online databases such as gnomAD. The use of under-represented and genetically isolated populations, such as the Amish, allowed for the identification of several novel benign variants and suggests that such populations may be a rich source for future studies. We also identified several both pathogenic and benign variants in a clinical population through a translational research study (DDD). While steps were taken to ensure that the benign variants attained from this group were indeed benign (all variants were present within either monoallelic genes or in biallelic genes in a homozygous state, and were annotated by the referring clinician as having no contribution towards the patient's clinical phenotype), nonetheless it cannot be guaranteed that the variants had no impact of protein function. The study underlines the need for improved data-sharing between clinical laboratories, including both pathogenic and benign rare variants.
This study supports the adoption of in silico meta-predictors for use in variant classification but recommends the use of a single meta-predictor over a consensus-based approach, as recommended by current ACMG guidelines. 1 Each of the tools uses different though heavily overlapping data sources and the feature list used by a tool should be carefully considered before the tool is used. Our results also suggest that tools that use gnomAD data directly may have low specificity when classifying rare or novel variants and that care should be taken when using these tools in conjunction with the ACGS guidelines, as presence in or absence from the gnomAD database is already accounted for in other evidence criteria. Although use of a meta-predictor tool offers advantages over the use of previously available and widely adopted in silico tools, there remain issues to be addressed before they can be used at a level greater than supporting evidence for clinical variant interpretation.