Journal of Molecular Biology
Volume 302, Issue 1, 8 September 2000, Pages 205-217
Journal home page for Journal of Molecular Biology

Regular article
T-coffee: a novel method for fast and accurate multiple sequence alignment1

https://doi.org/10.1006/jmbi.2000.4042Get rights and content

Abstract

We describe a new method (T-Coffee) for multiple sequence alignment that provides a dramatic improvement in accuracy with a modest sacrifice in speed as compared to the most commonly used alternatives. The method is broadly based on the popular progressive approach to multiple alignment but avoids the most serious pitfalls caused by the greedy nature of this algorithm. With T-Coffee we pre-process a data set of all pair-wise alignments between the sequences. This provides us with a library of alignment information that can be used to guide the progressive alignment. Intermediate alignments are then based not only on the sequences to be aligned next but also on how all of the sequences align with each other. This alignment information can be derived from heterogeneous sources such as a mixture of alignment programs and/or structure superposition. Here, we illustrate the power of the approach by using a combination of local and global pair-wise alignments to generate the library. The resulting alignments are significantly more reliable, as determined by comparison with a set of 141 test cases, than any of the popular alternatives that we tried. The improvement, especially clear with the more difficult test cases, is always visible, regardless of the phylogenetic spread of the sequences in the tests.

Introduction

The simultaneous alignment of three or more nucleotide or amino acid sequences is one of the commonest tasks in bioinformatics. Multiple alignments are an essential pre-requisite to many further analyses of protein families such as homology modeling or phylogenetic reconstruction, or are simply used to illustrate conserved and variable sites within a family. These alignments may be further used to derive profiles (Gribskov et al., 1987) or hidden Markov models Bucher et al 1996, Haussler et al 1993 that can be used to scour databases for distantly related members of the family.

The automatic generation of an accurate multiple alignment is potentially a daunting task. Ideally, one would make use of an in-depth knowledge of the evolutionary and structural relationships within the family, but this information is often lacking or difficult to use. General empirical models of protein evolution Benner et al 1992, Dayhoff 1978, Henikoff and Henikoff 1992 are widely used instead, but these can be difficult to apply when the sequences are less than 30 % identical (Sander & Schneider, 1991). Further, mathematically sound methods for carrying out alignments, using these models, can be extremely demanding in computer resources for more than a handful of sequences Carrillo and Lipman 1988, Wang and Jiang 1994. In practice, heuristic methods are used for all but the smallest data sets.

The most commonly used heuristic methods are based on the progressive-alignment strategy Feng and Doolittle 1987, Hogeweg and Hesper 1984, Taylor 1988. with ClustalW (Thompson et al., 1994) being the most widely used implementation. The idea is to take an initial, approximate, phylogenetic tree between the sequences and to gradually build up the alignment, following the order in the tree. Although successful in a wide variety of cases, this method suffers from its greediness. Errors made in the first alignments cannot be rectified later as the rest of the sequences are added in. T-Coffee is an attempt to minimize that effect, and although the strategy we propose here is also a greedy progressive method, it allows for much better use of information in the early stages, as we will see below.

The main alternative to progressive alignment is the simultaneous alignment of all the sequences. Two such packages exist (MSA (Lipman et al., 1989) and DCA (Stoye et al., 1997)), based on the Carrilo and Lipman (1988) algorithm, but they remain an extremely CPU and memory-intensive approach. Iterative strategies Gotoh 1996, Notredame and Higgins 1996 are another interesting alternative. They do not provide any guarantees about finding optimal solutions but are reasonably robust and much less sensitive to the number of sequences than their deterministic counterparts.

All of these methods attempt to carry out global alignments, where one tries to align the full lengths of the sequences with each other. Alternatively, one might wish to consider local similarity, as occurs when two proteins share only a domain or motif. For two-sequence comparisons, there is the well-known Smith and Waterman (1981) algorithm. Here we use Lalign (Huang & Miller, 1991), from the FASTA package (Pearson & Lipman, 1988), which is a variant of the Smith and Waterman method. It produces sets of non-overlapping local alignments from the comparison of two sequences. For multiple sequences, the Gibbs sampler (Lawrence et al., 1993) and Dialign2 (Morgenstern, 1999) are the main automatic methods. These programs often perform well when there is a clear block of ungapped alignment shared by all of the sequences. They perform poorly, however, on general sets of test cases when compared with global methods (Thompson et al., 1999b; this work). In principle, a method able to combine the best properties of global and local multiple alignments might be very powerful. This is the second motivation for T-Coffee: the design of a method that provides a simple, flexible and, most importantly, accurate solution to the problem of how to combine information of this sort. Accuracy is tested as overall performance on 141 test case alignments from the BaliBase collection Thompson et al 1999a, Thompson et al 1999b.

Section snippets

T-Coffee algorithm

T-Coffee (Tree-based Consistency Objective Function for alignment Evaluation) has two main features. First, it provides a simple and flexible means of generating multiple alignments, using heterogeneous data sources. The data from these sources are provided to T-Coffee via a library of pair-wise alignments. Here we demonstrate the power of T-Coffee by computing multiple alignments using a library that was generated using a mixture of local and global pair-wise alignments (Figure 1).

The second

Combining local and global alignments without extension

The effect of combining local and global alignments is shown in Table 1. Three alternative primary libraries (i.e. without extension) were used to make the alignments: the ClustalW pair-wise library (C), the Lalign pair-wise library (L), and pooling of the ClustalW and Lalign pair-wise libraries (CL). In each of the five BaliBase categories, the combination of local and global information (CL) induced a statistically meaningful improvement over the two single method-based protocols (Table 1).

Discussion

T-Coffee is a new progressive method for sequence alignment. It can combine signals from heterogeneous sources (e.g. sequence-alignment programs, structure alignments, threading, manual alignment, motifs and specific constraints) into a unique consensus multiple sequence alignment. We show here that a combination of local and global alignments leads to a significant increase in alignment accuracy. The method is more accurate than its counterparts and has proved successful in a wide variety of

Acknowledgements

The authors thank the following people: Philipp Bucher for useful discussions and advice at an early stage of the project, Willie Taylor for useful discussions, Nigel Douglas for providing us with an efficient LINUX environment, and Webb Miller and Bill Pearson for allowing us to use and modify Lalign from the FASTA package. This work was funded in part by grant 3100-49669.96 from the Swiss National Science Foundation.

References (38)

  • H. Carrillo et al.

    The multiple sequence alignment problem in biology

    SIAM J. Appl. Math.

    (1988)
  • M.O. Dayhoff
    (1978)
  • D.-F. Feng et al.

    Progressive sequence alignment as a prerequisite to correct phylogenetic trees

    J. Mol. Evol.

    (1987)
  • M. Gribskov et al.

    Profile analysisdetection of distantly related proteins

    Proc. Natl Acad. Sci. USA

    (1987)
  • D. Haussler et al.

    Proceedings for the 26th Hawaii International Conference on Systems Sciences

    (1993)
  • S. Henikoff et al.

    Amino acid substitution matrices from protein blocks

    Proc. Natl Acad. Sci. USA

    (1992)
  • P. Hogeweg et al.

    The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method

    J. Mol. Evol.

    (1984)
  • L. Holm et al.

    Third International conference on Intelligent Systems for Molecular Biology (ISMB)

    (1995)
  • J.D. Kececioglu

    The maximum weight trace problem in multiple sequence alignment

    Lect. Notes Comput. Sci.

    (1993)
  • Cited by (5797)

    View all citing articles on Scopus
    1

    Edited by J. Thornton

    View full text