Journal of Molecular Biology
Volume 322, Issue 4, 27 September 2002, Pages 891-901
Journal home page for Journal of Molecular Biology

Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction

https://doi.org/10.1016/S0022-2836(02)00813-6Get rights and content

Abstract

Methods for automated prediction of deleterious protein mutations have utilized both structural and evolutionary information but the relative contribution of these two factors remains unclear. To address this, we have used a variety of structural and evolutionary features to create simple deleterious mutation models that have been tested on both experimental mutagenesis and human allele data. We find that the most accurate predictions are obtained using a solvent-accessibility term, the Cβ density, and a score derived from homologous sequences, SIFT. A classification tree using these two features has a cross-validated prediction error of 20.5% on an experimental mutagenesis test set when the prior probability for deleterious and neutral cases is equal, whereas this prediction error is 28.8% and 22.2% using either the Cβ density or SIFT alone. The improvement imparted by structure increases when fewer homologs are available: when restricted to three homologs the prediction error improves from 26.9% using SIFT alone to 22.4% using SIFT and the Cβ density, or 24.8% using SIFT and a noisy Cβ density term approximating the inaccuracy of ab initio structures modeled by the Rosetta method. We conclude that methods for deleterious mutation prediction should include structural information when fewer than five to ten homologs are available, and that ab initio predicted structures may soon be useful in such cases when high-resolution structures are unavailable.

Introduction

The rapid discovery of single nucleotide polymorphisms (SNP) in the human genome has created an opportunity for high-throughput deleterious mutation prediction methods to discover and prioritize candidate human disease alleles from the pool of uncharacterized non-synonymous coding SNPs. Several methods have recently been developed specifically for this purpose,1., 2., 3. which utilize information from both evolution and protein structure. In principle, a protein mutation could be deleterious either because it destabilizes structure or it disrupts a functional site involved in catalysis, ligand-binding, or interaction with another protein. For this reason, one might expect evolutionary and structural information from proteins to complement one another: structural information should help to identify destabilizing mutations, while highly conserved positions in multiple sequence alignments (msa) can help to identify functional sites.4 While this principle is qualitatively clear, the relative importance and complementarity of these feature domains has not been well characterized with regard to computational prediction of intolerant mutations. Such characterization is important both for improving the accuracy of predictive models and for determining the relative error of predictions made when sequence or structural information is reduced or absent, as is often the case when few sequence homologs exist or only a coarse approximation to the protein structure is available.

In this study, we investigate the relative strength of a variety of structural and evolutionary features described in previous work and examine how the most effective features complement one another in simple classification models. We characterize the performance of classifiers as a function of the number of homologs available for the calculation of evolutionary features, which indicates how the reliability of deleterious mutation prediction is affected by the common problem of having few homologous sequences available. Finally, we explore the possibility of improving classification for cases where no experimental structure is available, by using structures generated from ab initio prediction methods.

Section snippets

Mutation test sets

The predictive power of individual structure and sequence-based features was studied using two sets of mutation data compiled to test the methods developed in this study (Table 1). The first of these test sets consists of data from laboratory mutagenesis experiments. While there is a significant amount of targeted mutagenesis data available, such data often tend to be biased towards particular structural sites, residue types, or other features of interest to experimentalists. Due to this biased

Discussion

Using a simple non-linear classification model validated on unbiased laboratory-assayed mutations, the lowest classification error found in this study was achieved using only two features: the Cβ density and the SIFT score, yielding a balanced classification error of 20.5%. These features used alone were, respectively, the most accurate structure and sequence-based metrics considered for classification of laboratory-assayed mutations in this study, with the Cβ density having a classification

Blosum62

The Blosum62 feature for each mutation utilized the log-likelihood ratios of tolerated residue substitution relative to random residue alignment, as calculated by the BLOSUM method10 with sequence pairs above 62% identity clustered to reduce amino acid pair counts from recently diverged sequence homologs.

Normalized site entropy

The normalized site entropy was calculated using sequence alignments constructed by SIFT.1 For each site in the alignment, the entropy was calculated for the probability distribution of all

Acknowledgements

We thank Pauline Ng and Steven Henikoff for providing mutagenesis data, access to the SIFT source code, and for helpful discussions regarding this problem. We thank Tanja Kortemme for providing side-chain entropy calculations for several proteins. This work was supported by the HHMI.

References (25)

  • A. Bairoch et al.

    The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000

    Nucl. Acids Res.

    (2000)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    Proc. Natl Acad. Sci.

    (1992)
  • Cited by (180)

    • An Integrated Deep-Mutational-Scanning Approach Provides Clinical Insights on PTEN Genotype-Phenotype Relationships

      2020, American Journal of Human Genetics
      Citation Excerpt :

      In three-dimensional space, gnomAD variants are significantly more solvent exposed (i.e., exposed at the surface of the protein) than the group of all clinical variants (medians of 87.4% versus 7.8%, respectively, p = 1.28 × 10−18, Mann-Whitney U test; Figure 1B). Variation at solvent-exposed positions is generally more tolerated than at positions that are not solvent exposed because these variants are less likely to disrupt protein structure.20 We hypothesized that variant-level molecular-phenotype data might uncover new genotype-phenotype associations, on the basis of the reasoning that protein function data should correlate better with clinical outcome than with variant locations in primary or tertiary sequence space.

    • Efficient region-based test strategy uncovers genetic risk factors for functional outcome in bipolar disorder

      2019, European Neuropsychopharmacology
      Citation Excerpt :

      Among the variety of functional annotations that may be used to select LD-blocks a priori, nsSNPs have been most extensively validated so far. Currently, SIFT and PolyPhen provide one of the most widely accepted and accurate (Saunders and Baker, 2002) annotations (nsSNPs) whereas it is still difficult to annotate and predict non-coding SNPs (Li and Wei, 2015). Analyzing nsSNP-containing LD-blocks focused this analysis on protein-coding regions of the genome with the extension that exploiting LD putatively included additional information from SNPs with other functionalities as well.

    View all citing articles on Scopus
    View full text