Improving the clinical interpretation of missense variants in X linked genes using structural analysis

Background Improving the clinical interpretation of missense variants can increase the diagnostic yield of genomic testing and lead to personalised management strategies. Currently, due to the imprecision of bioinformatic tools that aim to predict variant pathogenicity, their role in clinical guidelines remains limited. There is a clear need for more accurate prediction algorithms and this study aims to improve performance by harnessing structural biology insights. The focus of this work is missense variants in a subset of genes associated with X linked disorders. Methods We have developed a protein-specific variant interpreter (ProSper) that combines genetic and protein structural data. This algorithm predicts missense variant pathogenicity by applying machine learning approaches to the sequence and structural characteristics of variants. Results ProSper outperformed seven previously described tools, including meta-predictors, in correctly evaluating whether or not variants are pathogenic; this was the case for 11 of the 21 genes associated with X linked disorders that met the inclusion criteria for this study. We also determined gene-specific pathogenicity thresholds that improved the performance of VEST4, REVEL and ClinPred, the three best-performing tools out of the seven that were evaluated; this was the case in 11, 11 and 12 different genes, respectively. Conclusion ProSper can form the basis of a molecule-specific prediction tool that can be implemented into diagnostic strategies. It can allow the accurate prioritisation of missense variants associated with X linked disorders, aiding precise and timely diagnosis. In addition, we demonstrate that gene-specific pathogenicity thresholds for a range of missense prioritisation tools can lead to an increase in prediction accuracy.

interactions leading to structural instability. Similarly, introduction of a hydrophobic residue on the surface of a protein can make protein aggregation more likely.
v) Side-chain solvent accessibility. Residues can be hidden in the core of the protein or be solvent accessible on the surface. Measuring solvent accessibility to each residue allows for the categorization of the residues from completely hidden to fully accessible on the surface and anywhere in between. The dataset P variants in most of the proteins were identified to be towards the core of the protein rather than on the surface.
vi) Conservation at variant's site. The loss of a conserved residue is more likely to be detrimental to the structure and function of the protein. In particular, when the change is less conservative, i.e. replacement of a residue with another with very different physicochemical properties. The dataset P variants are expected to result in the loss of more conserved residues compared to the dataset N variants.
vii) Alteration in residues with special physicochemical properties. Variants involving glycine, proline, and cysteine were considered to more likely affect protein structure. Glycine is the smallest of the residues and replacing it in the core of a protein with any of the larger residues can create stress on the structure and destabilize the protein. The introduction of glycine into the core of a protein to replace one of the larger residues can also result in instability. Similarly, replacing proline with most other residues, or vice versa, is more likely to be destabilizing due to its unique ring-shaped structure. Proline can introduce a turn in the structure, such as in tight turns, which other residues can't replicate. The loss or gain of cysteine was also considered where surface/extracellular variants are likely to lead to the loss or formation of disulphide bonds, respectively, causing protein instability. viii) Disease propensity at the site of variant. The loss of residues previously associated with the disease are more likely to be pathogenic when mutated. This is equivalent to the residue having mutated more than once in the dataset for each protein.
ix) The secondary structure of the site of variant. Beta strands or alpha helices can be less tolerant to certain changes compared to loops. The introduction of a proline onto a beta strand, for instance, is likely to effectively break the strand and disturb the hydrogen bonds forming a beta sheet, whereas the same change may be more tolerated on an alpha helix.

x) Effect on protein-protein interaction and protein stability
The likelihood of the site of the variant being involved in protein-protein interaction were predicted using an external tool. The effect of the variants on protein stability was also predicted. More of the disease-implicated variants are likely to destabilize the protein structure or be involved in interactions.
xi) Disorders regions. Disordered regions are more flexible in nature and promiscuous in their ability to bind proteins. Variants in these regions can easily disturb this finely tuned region resulting in more sensitive and less specific, i.e. non-functional, binding which can lead to binding disruption or aggregation.

Variant features considered in certain genes (gene-specific features)
i) Variant clustering. In the proteins where variant-clustering was noted in dataset P in comparison to dataset B through visual inspection. The protein was either divided into halves by drawing a plain through the centre of its mass or it was separated into multiple protein/functional domains.
Variant clustering based on the secondary structure was also considered.
ii) Functional site. Variations at functional sites, including binding sites, are more likely to affect protein function.      Table S9b. A comparison of the Matthews Correlation Coefficient (MCC) scores with the optimized MCC for the performance of REVEL, VEST4, and ClinPred using all of the datasets.
The optimized MCC was generated using gene-or protein-specific pathogenicity thresholds. The gene-specific threshold was identified using 80% of the predictions from VEST4, REVEL, and ClinPred through repeated (n=10) 5-fold cross-validation with random subsampling. The optimized MCC score was generated using the rest (20%) of the predictions from each tool at the threshold identified for each gene. The original MCC score was generated using the suggested and widely used threshold of 0.  Table S10a. The gene-or protein-specific pathogenicity thresholds identified in 21 genes for three prediction tools using a subset of each of the datasets. For each gene, the dataset was balanced using undersampling, i.e. using a random subset from the majority class to match the number of variants in the minority class. The gene-specific threshold was identified using 80% of the predictions from VEST4, REVEL, and ClinPred through repeated (n=10) Table S10b. A comparison of the Matthews Correlation Coefficient (MCC) scores with the optimized MCC scores for the performance of REVEL, VEST4, and ClinPred using a subset of each dataset. For each gene, the dataset was balanced using undersampling, i.e. using a random subset from the majority class to match the number of variants in the minority class. The optimized MCC was generated using gene-or protein-specific pathogenicity thresholds. The gene-specific threshold was identified using 80% of the predictions from VEST4, REVEL, and ClinPred through repeated (n=10) 5-fold cross-validation with random subsampling. The optimized MCC score was generated using the rest (20%) of the predictions from each tool at the threshold identified for each gene. The original MCC score was generated using the suggested and widely used threshold of 0.5. The improvement in prediction performance, i.e. a higher optimized MCC score compared to the original, is highlighted in bold. The VEST4 predictions were unavailable for ALAS2 and NDP variants in the respective transcripts of interest.