Journal of Molecular Biology
Volume 283, Issue 2, 23 October 1998, Pages 489-506
Journal home page for Journal of Molecular Biology

Regular article
Principles governing amino acid composition of integral membrane proteins: application to topology prediction1

https://doi.org/10.1006/jmbi.1998.2107Get rights and content

Abstract

A new method is suggested here for topology prediction of helical transmembrane proteins. The method is based on the hypothesis that the localizations of the transmembrane segments and the topology are determined by the difference in the amino acid distributions in various structural parts of these proteins rather than by specific amino acid compositions of these parts. A hidden Markov model with special architecture was developed to search transmembrane topology corresponding to the maximum likelihood among all the possible topologies of a given protein. The prediction accuracy was tested on 158 proteins and was found to be higher than that found using prediction methods already available. The method successfully predicted all the transmembrane segments in 143 proteins out of the 158, and for 135 of these proteins both the membrane spanning regions and the topologies were predicted correctly. The observed level of accuracy is a strong argument in favor of our hypothesis.

Introduction

Integral membrane proteins play important and functionally diverse roles in living cells. So far, two basic classes are known, according to the structure of the membrane spanning segments. In the first class, all the transmembrane segments form an α-helical structure with lengths of 17 to 25 amino acid residues (von Heijne, 1994). Members of the second class are only known in the bacterial outer porins that have a 16-stranded β-barrel structure (Weiss & Schulz, 1992). While experimental structure determinations of globular proteins by means of X-ray crystallography are becoming more routine (Lattman, 1994), we cannot nurse such hopes for integral membrane proteins, due to the difficulties in crystallization of these proteins, though there are some new encouraging methods in sight (Gouaux, 1998).

However, it is commonly accepted that topology prediction of membrane proteins is easier, and results in higher accuracy than the prediction of the secondary structure of globular proteins. The number of known sequences is increasing rapidly, resulting in a large gap between that and the number of known structures. Since prediction methods are the most convenient and least expensive ways of determining proteins structures, there is a great demand for developing efficient prediction methods. In addition, comparison of prediction methods based on different ideas can help to reveal the principles governing the structure formation of proteins.

The development of prediction of transmembrane helices in integral membrane proteins proceeded via several steps. The first approaches were based on hydrophobicity analyses Kyte and Doolittle 1982, Eisenberg et al 1984, Engelman et al 1986, Cornette et al 1987, Esposti et al 1990, Ponnuswamy and Gromiha 1993, Gromiha and Ponnuswamy 1995, i.e. they used information only about the amino acids that contributed to the formation of transmembrane helices. Their accuracy could be increased by exploiting information not only from transmembrane segments: namely, by considering the different charge distribution between the inside and outside loops Boyd et al 1987, Hartmann et al 1989, von Heijne 1992, Sipos and von Heijne 1993. As the number of experiments dealing with topology increased in the last few years, resulting in more reliable data, several statistical procedures were developed by applying whole amino acid distributions in various structural parts of proteins for the predictions (Jones et al., 1994). Using the advantages of neural network-based algorithms and combining prediction methods with multiple alignments Persson and Argos 1994, Persson and Argos 1996, Lohmann et al 1994, Rost et al 1995, Rost et al 1996, Casadio et al 1996, the accuracy of the topology prediction reached the 70 to 80% level, while the accuracy of the prediction of the transmembrane helices reached the 90 to 95% level.

In a previous paper from our group a new method was used for sequence alignment of transmembrane proteins having a very low level of sequence similarities (Cserző et al., 1994). By this method we were able to locate the corresponding transmembrane segments and the method also give a high score for all pairs of transmembrane helices, indicating that certain transmembrane characteristics (namely the amino acid composition of these segments) are more relevant than the actual sequence similarity in the alignment. A prediction method based on this observation works well on a set of prokaryotic integral membrane proteins (Cserző et al., 1997). The application of the amino acid composition in distinguishing between the extracellular and intracellular proteins (Nakashima & Nishikawa, 1994) or in defining the folding class of proteins (Chou, 1995) shows that the amino acid composition of proteins contains enough information to predict their structure in “large resolution”.

Studying amino acid similarity in a large database by means of independence divergence calculation indicates that from the viewpoint of structure formation amino acids may be classified into slightly different groups than one would expect on the basis of their physico-chemical parameters (Tusnády et al., 1995). Since there is a big difference between the physical environments of the membrane-spanning segments and the cytoplasmic or extracytoplasmic sides of the membrane proteins, it is not suprising that the amino acid compositions of these parts are different. Therefore, it seems reasonable to expect that a more accurate prediction can be developed when the amino acid compositions of these segments are considered instead of using physico-chemical parameters like the hydrophobicity of the amino acids. Since integral membrane proteins have functionally diverse roles in cells and they are in different environments, these facts must be reflected in their amino acid compositions. Thus, enforcing some predetermined or common amino acid compositions of the structural parts of these proteins in topology prediction may produce false results.

Our method is based on the hypothesis that the differences between the amino acid distributions in the various structural parts are the main driving force in the folding of the membrane proteins, i.e. the topology of transmembrane proteins may be determined by the simple fact that the amino acid compositions of the various structural parts do show maximum differences rather than by enforcing specific compositions in these parts. The difference between two distributions can be characterized by the divergence function Kullback 1959, Gokhale and Kullback 1978. Divergence calculation was demonstrated to be a useful tool in sequence database analyses in our earlier work (Tusnády et al., 1995). Here we use the sum of divergence values between the distribution of amino acids of the structural parts and the distribution of residues in the whole protein to measure differences in the amino acid distributions of the structural parts. This sum differs only in a constant from the log-likelihood, therefore the topology of membrane proteins can be determined if their amino acid sequences can be segmented to some part (e.g. inside, outside and membrane) in such a way that the product of the relative frequencies of the amino acids of these segments along the amino acid sequence should be maximal. Using more types of structural parts or enabling some controls on the length of the various segments may enhance the power of the method. We can solve this task with use of hidden Markov model (HMM).

HMM is widely used in bioinformatics. The most widespread use of this method is in aligning sequences and generating profiles for protein families Baldi et al 1994, Krogh et al 1994a, Hughey and Krogh 1996. The profile shows the common sequence motifs of biopolymers (Lawrence & Reilly, 1990) or can be used for database searching for finding new sequence homologs for a given family White et al 1993, Krogh et al 1994b, Borodovsky et al 1995. A special application of this alignment procedure is in protein topology prediction using secondary structure sequences (Francesco et al., 1997). Secondary structure predictions not based on alignment were also developed Asai et al 1993, Stultz et al 1993, White et al 1993, though their accuracies were modest.

In contrast with other prediction methods HMM can be suited to particular problems. Any actual structural knowledge may be incorporated into the model’s architecture in order to increase its prediction power and to learn more about these proteins. Here, a special HMM is described showing that the maxima of the likelihood function on the space of all possible topologies of a given amino acid sequence correlate with the experimentally established topology. The accuracy of this method was tested in three different data sets. Prediction methods published earlier were compared with our method, to uncover the principles governing the structure formation of integral membrane proteins.

Section snippets

The hidden Markov model

Investigations of the transmembrane topology of proteins give the impression that transmembrane segments are not located randomly in the sequences. These segments tend to group. To test this hypothesis the length distribution of the segments between transmembrane helices or at the ends of polypeptide chains was checked. In a purely random case the distribution of these segments would be close to geometric distribution as shown in Figure 1, but segments of transmembrane proteins show a different

Conclusion

The accuracy of the prediction method described here indicates that the topology is determined by the maximum divergences of the amino acid distribution of the different structural parts in the membrane proteins rather than by the absolute composition of these parts.

This work is a wide generalization of the work of Jones et al. (1994). Improvements proposed by them are included in HMM automatically; for example, usage of multiple sequence information. The other advantage of HMM is that there is

The hidden Markov model

To apply the hidden Markov model, the model architecture has first to be defined; namely, the number of states, the possible transitions between states and the observation symbols of each state. The model described here consists of five states: loops (inside and outside, I and O, respectively), tails (inside and outside, i and o, respectively) and helices (h). The model is presented in Figure 2; our notation is given in Table 3. For defining the possible transitions between these states: first,

Acknowledgements

We thank Gábor Tusnády for very useful discussion and comments, and Zsuzsanna Dosztányi and Gábor Szirtes for their critical comments on the manuscript. We thank Burhard Rost for helping with discussion of results from the PHDhtm_ref method. This work was supported by research grants OTKA T017652, F019008 and F022051.

References (57)

  • B. Persson et al.

    Prediction of transmembrane segments in proteins utilising multiple sequence alignments

    J. Mol. Biol.

    (1994)
  • J. van Beilen et al.

    Topology of the membrane-bound alkane hydroxylase of pseudomonas oleovorans

    J. Biol. Chem.

    (1992)
  • G. von Heijne

    Membrane protein structure prediction

    J. Mol. Biol.

    (1992)
  • M. Weiss et al.

    Structure of porin refined at 1.8 Å resolution

    J. Mol. Biol.

    (1992)
  • R. Yan et al.

    Identification of a residue in the translocation pathway of a membrane carrier

    Cell

    (1993)
  • F.S. Altschul et al.

    Basic local alignment search tool

    J. Mol. Biol.

    (1990)
  • K. Asai et al.

    Prediction of proteins secondary structure by the hidden Markov model

    Comp. Appl. Biosci.

    (1993)
  • A. Bairoch et al.

    The SWISS-PROT proteins sequence bank

    Nucl. Acids Res.

    (1991)
  • P. Baldi et al.

    Hidden Markov models of biological primary sequence information

    Proc. Natl Acad. Sci. USA

    (1994)
  • L. Bergelson et al.

    Topological asymmetry of phospholipids in membranes

    Science

    (1977)
  • M. Borodovsky et al.

    Detection of new genes in a bacterial genome using Markov models for three gene classes

    Nucl. Acids Res.

    (1995)
  • D. Boyd et al.

    Determinants of membrane proteins topology

    Proc. Natl Acad. Sci. USA

    (1987)
  • M. Brown et al.

    Using Dirichlet mixture priors to derive hidden Markov models for protein families

  • R. Casadio et al.

    A predictor of transmembrane alpha-helix domains of proteins based on neural networks

    Eur. Biophys. J.

    (1996)
  • G.-Q. Chen et al.

    Reduction of membrane proteins hydrophobicity by site-directed mutagenesisintroduction of multiple polar residues in helix d of bacteriorhodopsin

    Protein Eng.

    (1997)
  • K.C. Chou

    A novel approach to predicting protein structural classes in a (20–1)-d amino acid composition space

    Proteins: Struct. Funct. Genet.

    (1995)
  • M. Cserző et al.

    Prediction of transmembrane alpha-helices in prokariotic membrane proteinsthe dense alignment surface method

    Protein Eng.

    (1997)
  • T.J. Dueweke et al.

    Proteolysis of the cytochrome d complex with trypsin localizes a quinol oxidase domain

    Biochemistry

    (1991)
  • Cited by (964)

    • Molecular and functional characterization of a SCD 1b from European sea bass (Dicentrarchus labrax L.)

      2022, Comparative Biochemistry and Physiology Part - B: Biochemistry and Molecular Biology
    • Functional characterization of hypothetical proteins from Monkeypox virus

      2023, Journal of Genetic Engineering and Biotechnology
    View all citing articles on Scopus
    1

    Edited by J. Thornton

    View full text