Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction

doi:10.1016/S0022-2836(02)00813-6

Journal of Molecular Biology

Volume 322, Issue 4, 27 September 2002, Pages 891-901

https://doi.org/10.1016/S0022-2836(02)00813-6 Get rights and content

Abstract

Methods for automated prediction of deleterious protein mutations have utilized both structural and evolutionary information but the relative contribution of these two factors remains unclear. To address this, we have used a variety of structural and evolutionary features to create simple deleterious mutation models that have been tested on both experimental mutagenesis and human allele data. We find that the most accurate predictions are obtained using a solvent-accessibility term, the C^β density, and a score derived from homologous sequences, SIFT. A classification tree using these two features has a cross-validated prediction error of 20.5% on an experimental mutagenesis test set when the prior probability for deleterious and neutral cases is equal, whereas this prediction error is 28.8% and 22.2% using either the C^β density or SIFT alone. The improvement imparted by structure increases when fewer homologs are available: when restricted to three homologs the prediction error improves from 26.9% using SIFT alone to 22.4% using SIFT and the C^β density, or 24.8% using SIFT and a noisy C^β density term approximating the inaccuracy of ab initio structures modeled by the Rosetta method. We conclude that methods for deleterious mutation prediction should include structural information when fewer than five to ten homologs are available, and that ab initio predicted structures may soon be useful in such cases when high-resolution structures are unavailable.

Introduction

The rapid discovery of single nucleotide polymorphisms (SNP) in the human genome has created an opportunity for high-throughput deleterious mutation prediction methods to discover and prioritize candidate human disease alleles from the pool of uncharacterized non-synonymous coding SNPs. Several methods have recently been developed specifically for this purpose,1., 2., 3. which utilize information from both evolution and protein structure. In principle, a protein mutation could be deleterious either because it destabilizes structure or it disrupts a functional site involved in catalysis, ligand-binding, or interaction with another protein. For this reason, one might expect evolutionary and structural information from proteins to complement one another: structural information should help to identify destabilizing mutations, while highly conserved positions in multiple sequence alignments (msa) can help to identify functional sites.⁴ While this principle is qualitatively clear, the relative importance and complementarity of these feature domains has not been well characterized with regard to computational prediction of intolerant mutations. Such characterization is important both for improving the accuracy of predictive models and for determining the relative error of predictions made when sequence or structural information is reduced or absent, as is often the case when few sequence homologs exist or only a coarse approximation to the protein structure is available.

In this study, we investigate the relative strength of a variety of structural and evolutionary features described in previous work and examine how the most effective features complement one another in simple classification models. We characterize the performance of classifiers as a function of the number of homologs available for the calculation of evolutionary features, which indicates how the reliability of deleterious mutation prediction is affected by the common problem of having few homologous sequences available. Finally, we explore the possibility of improving classification for cases where no experimental structure is available, by using structures generated from ab initio prediction methods.

Section snippets

Mutation test sets

The predictive power of individual structure and sequence-based features was studied using two sets of mutation data compiled to test the methods developed in this study (Table 1). The first of these test sets consists of data from laboratory mutagenesis experiments. While there is a significant amount of targeted mutagenesis data available, such data often tend to be biased towards particular structural sites, residue types, or other features of interest to experimentalists. Due to this biased

Discussion

Using a simple non-linear classification model validated on unbiased laboratory-assayed mutations, the lowest classification error found in this study was achieved using only two features: the C^β density and the SIFT score, yielding a balanced classification error of 20.5%. These features used alone were, respectively, the most accurate structure and sequence-based metrics considered for classification of laboratory-assayed mutations in this study, with the C^β density having a classification

Blosum62

The Blosum62 feature for each mutation utilized the log-likelihood ratios of tolerated residue substitution relative to random residue alignment, as calculated by the BLOSUM method¹⁰ with sequence pairs above 62% identity clustered to reduce amino acid pair counts from recently diverged sequence homologs.

Normalized site entropy

The normalized site entropy was calculated using sequence alignments constructed by SIFT.¹ For each site in the alignment, the entropy was calculated for the probability distribution of all

Acknowledgements

We thank Pauline Ng and Steven Henikoff for providing mutagenesis data, access to the SIFT source code, and for helpful discussions regarding this problem. We thank Tanja Kortemme for providing side-chain entropy calculations for several proteins. This work was supported by the HHMI.

References (25)

O. Lichtarge et al.
An evolutionary trace method defines binding surfaces common to protein families
J. Mol. Biol.
(1996)
P. Markiewicz et al.
Genetic studies of the lac repressor XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as spacers which do not require a specific sequence
J. Mol. Biol.
(1994)
D. Chasman et al.
Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation
J. Mol. Biol.
(2001)
K.T. Simons et al.
Prospects for ab initio protein structural genomics
J. Mol. Biol.
(2001)
A.R. Panchenko et al.
Combination of threading potentials and sequence profiles improves fold recognition
J. Mol. Biol.
(2000)
P.C. Ng et al.
Predicting deleterious amino acid substitutions
Genome Res.
(2001)
S. Sunyaev et al.
Prediction of deleterious human alleles
Hum. Mol. Genet.
(2001)
Z. Wang et al.
SNP's, protein structure, and disease
Hum. Mut.
(2001)
D.D. Loeb et al.
Complete mutagenesis of the HIV-1 protease
Nature
(1989)
D. Rennell et al.
Systematic mutation of bacteriophage T4 lysozyme
J. Mol. Biol.
(1991)

A. Bairoch et al.

The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000

Nucl. Acids Res.

(2000)

S. Henikoff et al.

Amino acid substitution matrices from protein blocks

Proc. Natl Acad. Sci.

(1992)

Cited by (180)

Computational analysis of missense variants in MMP2 gene linked with Winchester syndrome and Nodulosis-Arthropathy-Osteolysis reveals structural shift in protein-protein and protein-ligand complexes
2021, Meta Gene
Matrix metalloproteinases - 2 (MMP2) protein stimulates multiple processes involving the nervous system, vascularization, and metastasis. Mutations in MMP2 is linked to Winchester and Nodulosis-Arthropathy-Osteolysis (NAO) syndromes. In this extensive investigation, we performed a computational analysis of 114 missense Single Nucleotide Polymorphisms (SNPs) of MMP2 protein using various in-silico algorithms. A total of 21 highly deleterious and pathogenic missense SNPs (T86K, R146H, A167T, G216E, R252P, R252L D326V, D326Y, G346S, G367S, R368L, S396R, A408P, R482C, P517L, A522E, E525K, Y543C, Y552S, K579M, M598T) were identified that probably could alter the structural and functional configuration of MMP2 gene. Moreover, conservation analysis, protein stability, TM-score and RMSD calculation, protein structure prediction and superimposition, ligand binding site prediction, protein-protein interaction, protein-ligand, and protein-protein docking studies were carried out. ConSurf analysis showed seventeen missense variants in the highly conserved regions, which were predicted as highly deleterious and pathogenic by eight in-silico platforms. Furthermore, G367S and K579M showed a greater impact on stability, structural alterations, protein-ligand and protein-protein interactions. This study will help in developing target-dependent medication for diseases and could enhance the understanding of the significance of uncharacterized missense SNPs and their interrelation with the disease. This study also contemplates the computational perception into the high-risk missense SNPs on protein structural and functional configuration.
An Integrated Deep-Mutational-Scanning Approach Provides Clinical Insights on PTEN Genotype-Phenotype Relationships
2020, American Journal of Human Genetics
Citation Excerpt :
In three-dimensional space, gnomAD variants are significantly more solvent exposed (i.e., exposed at the surface of the protein) than the group of all clinical variants (medians of 87.4% versus 7.8%, respectively, p = 1.28 × 10−18, Mann-Whitney U test; Figure 1B). Variation at solvent-exposed positions is generally more tolerated than at positions that are not solvent exposed because these variants are less likely to disrupt protein structure.20 We hypothesized that variant-level molecular-phenotype data might uncover new genotype-phenotype associations, on the basis of the reasoning that protein function data should correlate better with clinical outcome than with variant locations in primary or tertiary sequence space.
Germline variation in PTEN results in variable clinical presentations, including benign and malignant neoplasia and neurodevelopmental disorders. Despite decades of research, it remains unclear how the PTEN genotype is related to clinical outcomes. In this study, we combined two recent deep mutational scanning (DMS) datasets probing the effects of single amino acid variation on enzyme activity and steady-state cellular abundance with a large, well-curated clinical cohort of PTEN-variant carriers. We sought to connect variant-specific molecular phenotypes to the clinical outcomes of individuals with PTEN variants. We found that DMS data partially explain quantitative clinical traits, including head circumference and Cleveland Clinic (CC) score, which is a semiquantitative surrogate of disease burden. We built logistic regression models that use DMS and CADD scores to separate clinical PTEN variation from gnomAD control-only variation with high accuracy. By using a survival-like analysis, we identified molecular phenotype groups with differential risk of early cancer onset as well as lifetime risk of cancer. Finally, we identified classes of DMS-defined variants with significantly different risk levels for classical hamartoma-related features (odds ratio [OR] range of 4.1–102.9). In stark contrast, the risk for developing autism or developmental delay does not significantly change across variant classes (OR range of 5.4–12.4). Together, these findings highlight the potential impact of combining DMS datasets with rich clinical data and provide new insights that might guide personalized clinical decisions for PTEN-variant carriers.
Machine learning, the kidney, and genotype–phenotype analysis
2020, Kidney International
With biomedical research transitioning into data-rich science, machine learning provides a powerful toolkit for extracting knowledge from large-scale biological data sets. The increasing availability of comprehensive kidney omics compendia (transcriptomics, proteomics, metabolomics, and genome sequencing), as well as other data modalities such as electronic health records, digital nephropathology repositories, and radiology renal images, makes machine learning approaches increasingly essential for analyzing human kidney data sets. Here, we discuss how machine learning approaches can be applied to the study of kidney disease, with a particular focus on how they can be used for understanding the relationship between genotype and phenotype.
Screening and insilico analysis of deleterious nsSNPs (missense) in human CSF3 for their effects on protein structure, stability and function
2019, Computational Biology and Chemistry
Human granulocyte colony stimulating factor (hG-CSF), known as CSF3, plays an important role in the growth, differentiation, proliferation, survival, and activation of the granulocyte cell lineage such as neutrophils and their precursors. Functional reduction in native CSF3 protein results in reduced proliferation and activation of neutrophils and leads to neutropenia. Single nucleotide polymorphisms (SNPs) in the CSF3 gene may have deleterious effects on the CSF3 protein function. This study was undertaken to find the functional SNPs in the human CSF3 gene. Results suggest that 18.9% of all the SNPs in the dbSNP database for CSF3 gene were present in the coding region. Out of 59 non-synonymous SNPs (nsSNPs), 26 nsSNPs were predicted to be non-tolerable by SIFT whereas 18 and 7 nsSNPs were predicted as probably damaging and possibly damaging, respectively by PolyPhen. Among this 31 nsSNPs, 16 nsSNPs were identified to be potentially deleterious by PANTHER server, and 4 nsSNPs were found to be neutral by PROVEAN. SNPAnalyzer predicted 7 nsSNPs to be neutral phenotype and the remaining 24 nsSNPs to be associated with diseases. Among the predicted nsSNPs, rs144408123, rs144408123, rs145136406, rs145311241, rs373191696, rs762945096, rs763688260, rs767572172, rs775326370, rs777777864, rs777983866, rs781596455, rs139072004, rs757612684, rs772911210, rs139072004, rs746634544, rs749993200, rs763426127, rs772466210 were identified as deleterious and potentially damaging. I-Mutant analysis revealed that th 20 nsSNPs were important for protein stability of CSF3. Therefore, th 20 nsSNPs may be used for the wider population-based genetic studies and also should be taken into account while engineering the recombinant CSF3 protein for clinical use.
Efficient region-based test strategy uncovers genetic risk factors for functional outcome in bipolar disorder
2019, European Neuropsychopharmacology
Citation Excerpt :
Among the variety of functional annotations that may be used to select LD-blocks a priori, nsSNPs have been most extensively validated so far. Currently, SIFT and PolyPhen provide one of the most widely accepted and accurate (Saunders and Baker, 2002) annotations (nsSNPs) whereas it is still difficult to annotate and predict non-coding SNPs (Li and Wei, 2015). Analyzing nsSNP-containing LD-blocks focused this analysis on protein-coding regions of the genome with the extension that exploiting LD putatively included additional information from SNPs with other functionalities as well.
Genome-wide association studies of case-control status have advanced the understanding of the genetic basis of psychiatric disorders. Further progress may be gained by increasing sample size but also by new analysis strategies that advance the exploitation of existing data, especially for clinically important quantitative phenotypes. The functionally-informed efficient region-based test strategy (FIERS) introduced herein uses prior knowledge on biological function and dependence of genotypes within a powerful statistical framework with improved sensitivity and specificity for detecting consistent genetic effects across studies. As proof of concept, FIERS was used for the first genome-wide single nucleotide polymorphism (SNP)-based investigation on bipolar disorder (BD) that focuses on an important aspect of disease course, the functional outcome. FIERS identified a significantly associated locus on chromosome 15 (hg38: chr15:48965004 – 49464789 bp) with consistent effect strength between two independent studies (GAIN/TGen: European Americans, BOMA: Germans; n = 1592 BD patients in total). Protective and risk haplotypes were found on the most strongly associated SNPs. They contain a CTCF binding site (rs586758); CTCF sites are known to regulate sets of genes within a chromatin domain. The rs586758 – rs2086256 – rs1904317 haplotype is located in the promoter flanking region of the COPS2 gene, close to microRNA4716, and the EID1, SHC4, DTWD1 genes as plausible biological candidates. While implication with BD is novel, COPS2, EID1, and SHC4 are known to be relevant for neuronal differentiation and function and DTWD1 for psychopharmacological side effects. The test strategy FIERS that enabled this discovery is equally applicable for tag SNPs and sequence data.
Structural Principles Governing Disease-Causing Germline Mutations
2018, Journal of Molecular Biology
Advancements in sequencing in the past decades enabled not only the determination of the human proteome but also the identification of a large number of genetic variations in the human population. The phenotypic effects of these mutations range from neutral for polymorphisms to severe for some somatic mutations. Disease-causing germline mutations (DCMs) represent a special and largely understudied class with relatively weak phenotypes. While for somatic mutations their effect on protein structure and regulation has been extensively studied in select cases, for germline mutations, this information is currently largely missing. In this analysis, a large amount of DCMs were analyzed and contrasted to polymorphisms from a structural point of view. Our results delineate the characteristic features of DCMs starting at the global level of partitioning proteins into globular, disordered and transmembrane classes, moving toward smaller structural units describing secondary structure elements and molecular surfaces, reaching down to the smallest structural entity, post-translational modifications. We show how these structural entities influence the emergence and possible phenotypic effects of DCMs.

View all citing articles on Scopus

View full text

Journal of Molecular Biology

Evaluation of Structural and Evolutionary Contributions to Deleterious Mutation Prediction

Abstract

Introduction

Section snippets

Mutation test sets

Discussion

Blosum62

Normalized site entropy

Acknowledgements

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

J. Mol. Biol.

Predicting deleterious amino acid substitutions

Genome Res.

Prediction of deleterious human alleles

Hum. Mol. Genet.

SNP's, protein structure, and disease

Hum. Mut.

Complete mutagenesis of the HIV-1 protease

Nature

Systematic mutation of bacteriophage T4 lysozyme

J. Mol. Biol.

The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000

Nucl. Acids Res.

Amino acid substitution matrices from protein blocks

Proc. Natl Acad. Sci.