Journal of Molecular Biology
Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction
Introduction
The rapid discovery of single nucleotide polymorphisms (SNP) in the human genome has created an opportunity for high-throughput deleterious mutation prediction methods to discover and prioritize candidate human disease alleles from the pool of uncharacterized non-synonymous coding SNPs. Several methods have recently been developed specifically for this purpose,1., 2., 3. which utilize information from both evolution and protein structure. In principle, a protein mutation could be deleterious either because it destabilizes structure or it disrupts a functional site involved in catalysis, ligand-binding, or interaction with another protein. For this reason, one might expect evolutionary and structural information from proteins to complement one another: structural information should help to identify destabilizing mutations, while highly conserved positions in multiple sequence alignments (msa) can help to identify functional sites.4 While this principle is qualitatively clear, the relative importance and complementarity of these feature domains has not been well characterized with regard to computational prediction of intolerant mutations. Such characterization is important both for improving the accuracy of predictive models and for determining the relative error of predictions made when sequence or structural information is reduced or absent, as is often the case when few sequence homologs exist or only a coarse approximation to the protein structure is available.
In this study, we investigate the relative strength of a variety of structural and evolutionary features described in previous work and examine how the most effective features complement one another in simple classification models. We characterize the performance of classifiers as a function of the number of homologs available for the calculation of evolutionary features, which indicates how the reliability of deleterious mutation prediction is affected by the common problem of having few homologous sequences available. Finally, we explore the possibility of improving classification for cases where no experimental structure is available, by using structures generated from ab initio prediction methods.
Section snippets
Mutation test sets
The predictive power of individual structure and sequence-based features was studied using two sets of mutation data compiled to test the methods developed in this study (Table 1). The first of these test sets consists of data from laboratory mutagenesis experiments. While there is a significant amount of targeted mutagenesis data available, such data often tend to be biased towards particular structural sites, residue types, or other features of interest to experimentalists. Due to this biased
Discussion
Using a simple non-linear classification model validated on unbiased laboratory-assayed mutations, the lowest classification error found in this study was achieved using only two features: the Cβ density and the SIFT score, yielding a balanced classification error of 20.5%. These features used alone were, respectively, the most accurate structure and sequence-based metrics considered for classification of laboratory-assayed mutations in this study, with the Cβ density having a classification
Blosum62
The Blosum62 feature for each mutation utilized the log-likelihood ratios of tolerated residue substitution relative to random residue alignment, as calculated by the BLOSUM method10 with sequence pairs above 62% identity clustered to reduce amino acid pair counts from recently diverged sequence homologs.
Normalized site entropy
The normalized site entropy was calculated using sequence alignments constructed by SIFT.1 For each site in the alignment, the entropy was calculated for the probability distribution of all
Acknowledgements
We thank Pauline Ng and Steven Henikoff for providing mutagenesis data, access to the SIFT source code, and for helpful discussions regarding this problem. We thank Tanja Kortemme for providing side-chain entropy calculations for several proteins. This work was supported by the HHMI.
References (25)
- et al.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
(1996) - et al.
Genetic studies of the lac repressor XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as spacers which do not require a specific sequence
J. Mol. Biol.
(1994) - et al.
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation
J. Mol. Biol.
(2001) - et al.
Prospects for ab initio protein structural genomics
J. Mol. Biol.
(2001) - et al.
Combination of threading potentials and sequence profiles improves fold recognition
J. Mol. Biol.
(2000) - et al.
Predicting deleterious amino acid substitutions
Genome Res.
(2001) - et al.
Prediction of deleterious human alleles
Hum. Mol. Genet.
(2001) - et al.
SNP's, protein structure, and disease
Hum. Mut.
(2001) - et al.
Complete mutagenesis of the HIV-1 protease
Nature
(1989) - et al.
Systematic mutation of bacteriophage T4 lysozyme
J. Mol. Biol.
(1991)
The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000
Nucl. Acids Res.
Amino acid substitution matrices from protein blocks
Proc. Natl Acad. Sci.
Cited by (180)
An Integrated Deep-Mutational-Scanning Approach Provides Clinical Insights on PTEN Genotype-Phenotype Relationships
2020, American Journal of Human GeneticsCitation Excerpt :In three-dimensional space, gnomAD variants are significantly more solvent exposed (i.e., exposed at the surface of the protein) than the group of all clinical variants (medians of 87.4% versus 7.8%, respectively, p = 1.28 × 10−18, Mann-Whitney U test; Figure 1B). Variation at solvent-exposed positions is generally more tolerated than at positions that are not solvent exposed because these variants are less likely to disrupt protein structure.20 We hypothesized that variant-level molecular-phenotype data might uncover new genotype-phenotype associations, on the basis of the reasoning that protein function data should correlate better with clinical outcome than with variant locations in primary or tertiary sequence space.
Machine learning, the kidney, and genotype–phenotype analysis
2020, Kidney InternationalScreening and insilico analysis of deleterious nsSNPs (missense) in human CSF3 for their effects on protein structure, stability and function
2019, Computational Biology and ChemistryEfficient region-based test strategy uncovers genetic risk factors for functional outcome in bipolar disorder
2019, European NeuropsychopharmacologyCitation Excerpt :Among the variety of functional annotations that may be used to select LD-blocks a priori, nsSNPs have been most extensively validated so far. Currently, SIFT and PolyPhen provide one of the most widely accepted and accurate (Saunders and Baker, 2002) annotations (nsSNPs) whereas it is still difficult to annotate and predict non-coding SNPs (Li and Wei, 2015). Analyzing nsSNP-containing LD-blocks focused this analysis on protein-coding regions of the genome with the extension that exploiting LD putatively included additional information from SNPs with other functionalities as well.
Structural Principles Governing Disease-Causing Germline Mutations
2018, Journal of Molecular Biology