Article Text

PDF

Estimating the age of rare disease mutations: the example of Triple-A syndrome
  1. E Genin1,
  2. A Tullio-Pelet2,
  3. F Begeot2,
  4. S Lyonnet2,
  5. L Abel3
  1. 1Génétique Epidémiologique et Structure des Populations Humaines, INSERM U535, Hôpital Paul Brousse, BP 1000, 94 817 Villejuif Cedex, France
  2. 2Handicaps Génétiques de l’Enfant, INSERM U393, Hôpital Necker-Enfants Malades, 149 rue de Sèvres, 75015 Paris, France
  3. 3Génétique Humaine des Maladies Infectieuses, Université René Descartes INSERM U550, Faculté de Médecine Necker, 156 rue de Vaugirard, 75015 Paris, France
  1. Correspondence to:
 Dr L Abel
 Laboratoire de Génétique Humaine des Maladies Infectieuses, INSERM U550, Faculté de Médecine Necker, 156 rue de Vaugirard, 75015 Paris, France; abelnecker.fr

Statistics from Altmetric.com

Triple-A syndrome (MIM 231550) is an autosomal recessive disorder characterised by adrenocorticotrophin hormone resistant adrenal insufficiency, achalasia of the oesophageal cardia, and alacrima.1 The gene, previously localised to chromosome 12q13,2,3 was recently identified and denoted as AAAS.4 Among the five homozygous truncating mutations that were characterised, a single splice donor splice mutation (IVS14+1G→A) was found in several unrelated affected individuals of north African origin, strongly suggesting a founder effect.4 In this work, we were interested in estimating the time at which the mutation occurred, since this is expected to provide interesting and helpful information on its natural history. Different methods have been proposed to estimate the age of mutations,5,6 which are based either on allele frequencies7 or on intra-allelic variability and the pattern of linkage disequilibrium at closely linked marker loci8,9 with extensions to the analysis of multilocus data.9–12 However, these latter approaches are dedicated more to the fine mapping of the mutation, and may not be appropriate for estimating the age of rare mutant alleles that are only found in very few affected individuals. Therefore, we developed a new simple method based on likelihood that uses multilocus marker data to estimate the age of the most recent common ancestor of the mutation from a small number of patients. The method was tested through simulations and then applied to nine consanguineous Triple-A patients, who were homozygous for the IVS14+1G→A mutation.

MATERIALS AND METHODS

Method of dating

We will first describe the principle of the method for two affected individuals who carry the studied mutation at the disease locus D, and then extend the method to a sample of N affected individuals. The basic assumption is that the two affected individuals descend from a common ancestor who introduced the mutation ngen generations ago. The problem is to estimate ngen from the size of the haplotype shared by the two individuals on each side of D. For simplicity, we will describe the method for only one side, the right side, as the treatment for the other side is completely analogous and independent. Let (M1,...,Mx,...,Mk), be a set of k ordered markers that have been typed on the right side of D, which are located at recombination fractions (θ1,...,θx,...,θk) from D. Mk is the first marker on the right side for which the two individuals have different alleles. The problem can be understood as a survival analysis problem in which the starting point is the disease locus, the discrete time scale is the genetic distance, and the event of interest is the occurrence of a recombination. Therefore, we define:

  • S(x), the probability that no recombination took place during n generations between D and marker Mx: S(x) = (1−θx)n

  • f(x), the probability that a recombination took place exactly in the xth interval between marker Mx−1 and Mx: f(x) = S(x−1)−S(x)

Key points

  • Among the causing mutations that were recently characterised in Triple-A syndrome (MIM 231550), the IVS14+1G→A mutation was found in several unrelated affected individuals of north African origin, strongly suggesting a founder effect. Although different methods have been proposed to estimate the age of relatively common mutations, none of these methods has been evaluated in the context of rare monogenic diseases such as Triple-A syndrome.

  • We first developed a simple likelihood based method to estimate the age of the most recent common ancestor of the patients, based on the information provided by haplotypes shared by affected individuals. Through simulation studies, we show that the method provided satisfactory results even with small samples (≈5 individuals), and investigate the influence on age estimates of misspecification of both allele frequencies and mutation rates at the marker loci.

  • We applied the method to the analysis of nine patients with Triple-A, who were carriers of the IVS14+1G→A mutation, and found an age estimate of ≈1000–1175 years. Interestingly, this period corresponds to a time of important migrations into north Africa from the Arabian peninsula.

For two individuals who shared all their alleles for markers (M1,...,Mk−1) and have a different allele at marker Mk, two possibilities should be considered:

  • for both individuals, a recombination did occur in the kth interval between markers Mk−1 and Mk, and the likelihood is:

Embedded Image

  • only one of the two individuals has recombined in the kth interval, and the other has received the ancestral allele at Mk (and thus not recombined by the kth interval); in this case the likelihood is:

Embedded Image

Therefore, the likelihood for this side is the sum of La(n) and Lb(n), and the final likelihood for two individuals is the product of the two side specific likelihoods.

We consider now a sample of N affected individuals who carry the studied mutation, and assume that those N affected individuals descend independently from the same ancestor (star genealogy6). The issue of different lineages occurring at different times will be discussed later. For each side, two groups of individuals should be determined among the N. The first group, denoted as G1, corresponds to the y individuals (2⩽yN) who shared the longest haplotype—that is, those y individuals share all their alleles for markers (M1,..., Mk−1), and have a different allele at marker Mk. The second group, denoted as G2, corresponds to the remaining Ny individuals for which the first recombination occurred in an interval xi, i = 1 to (Ny), closer to the mutation (xi<k). For individuals of G1, the likelihood is written using the same principle as that for two individuals. Expressions 1 and 2 become

Embedded Image

and

Embedded Image

The likelihood contribution of an individual i belonging to G2 who has recombined within interval xi is simply:

Embedded Image

The likelihood of n given the data on the studied side is thus:

Embedded Image

and the final likelihood, L(n), is the product of the two side specific likelihoods.

In all the previous derivations, we have not accounted for the possibility that a descendant haplotype may vary from its ancestor because of mutations rather than recombinations. This could lead to an underestimation of the age of the disease causing mutation to an extent that depends on the mutation rates at the markers. Moreover, equation 6 only holds if shared alleles indicate identity by descent so that the positions of crossovers can be inferred from the observed data without ambiguity. When the polymorphism of markers is low, it becomes necessary to take allele frequencies into account. We present in a supplementary material available at http://u535.vjf.inserm.fr/Pagesperso/GENIN/supplement_jmg.htm, extensions of the method with the corresponding formulas that take into account both mutations at the marker locus and marker allele frequencies. The supplementary material also provides the details of the computation for the 95% confidence interval of the estimated generation number. This computation is based on Bayesian principles as proposed in Piccolo et al.13

Simulated data

Simulations were conducted to investigate the proposed method. To generate individual haplotype data, we considered a mutation at a disease locus D and a set of equally distant markers on each side of D. For sample sizes larger than 10 individuals, two different recombination fractions between markers were considered: θ = 0.01 and θ = 0.002. For smaller sample sizes, to ensure that individuals shared at least one marker on each side of the mutation, distances between markers were reduced and the recombination fractions were 0.0005. Markers were assumed to have different numbers of alleles with equal frequencies that were varied from 0.1 to 0.3. After simulating an ancestral haplotype and giving the number of generations, Monte Carlo methods were used to generate the positions of crossovers and marker mutations. The marker mutation rate, μ, was increased to 10−3, since mutation rates of this magnitude have been observed for some short tandem repeat loci in human beings.14 This process was repeated until the desired number of individuals was reached. The simulated data were then analysed with the method described above. All computations were performed using a C program we developed, which is available upon request (email E Genin at geninvjf.inserm.fr).

Different simulation conditions were considered according to the number of individuals per sample (from 2 to 30), number of generations (denoted as ngen and ranging from 10 to 100), marker specificities (recombination fraction θ, allele frequency p, mutation rates). For each simulation condition, 5000 replicates were generated. The results present the mean number of generations (with empirical 95% confidence interval) estimated over the 5000 replicates. Because of the nature of the problem, the distribution of these estimates is highly skewed to the right, especially for low numbers of individuals and low numbers of generations. For this reason, we also report the median of the estimates, and the coverage—that is, the proportion of the replicates, pout-ci, where the true value of ngen falls outside the 95% confidence interval of the estimate.

Triple-A data

In the present study, we investigated nine consanguineous patients with clinical features defining Triple-A, who were homozygous for the IVS14+1G→A mutation. All patients originated from north Africa (mainly from Algeria and Tunisia) and had no known familial relationships. To estimate the age of the founder Triple-A mutation, we selected the polymorphic markers encompassing the AAAS locus according to the genetic and physical maps available at http://genome.ucsc.edu and www.ncbi.nlm.nih.gov. As the genetic distances available for closely linked markers are generally not very accurate, rates of recombination between markers were computed using both the overall genetic length of the haplotype and the physical distances between markers as proposed in Picard et al.15 The closest markers at which affected individuals do not share any alleles were determined on both sides of the AAAS locus. Genotyping was performed on DNA extracted from blood according to standard procedures as described in Hadj-Rabia et al.3 The frequency of shared alleles was estimated from a sample of 30 unrelated subjects of north African origin.

RESULTS

Simulation study

In the top of table 1, results are presented for a very small number of individuals (2 and 5) in a situation of complete identity by descent information and for a recombination fraction of 0.0005. Although an overestimation of the number of generations (ngen) is observed for two individuals, this result is due to a small number of replicates that led to very high estimates of ngen. In this situation, the median value provides a more reliable measure of the estimates, and is much closer to the true value of ngen than the mean. When five individuals are used, the estimates become quite satisfactory for the median value as well as for pout-ci. As the sample size increases (bottom of table 1), estimates become very good and confidence intervals are reduced. Similar results were obtained for a recombination of 0.002 between markers (data not shown), indicating that the recombination fraction has no substantial influence on the estimates in the situation of complete information on identity by descent.

Table 1

Mean and median estimates over 5000 replicates of the number of generations for different true values, ngen, and different sample sizes

Table 2 presents the impact of misspecifying mutation rates in samples of 10 individuals for two recombination fractions (θ = 0.01 and θ = 0.002), and assuming complete information on identity by descent. Three different mutation rates (0, 10−6, and 10−3) at the markers were used both to simulate (μsim) and analyse (μana) the data. Table 2 presents the results when all allele changes are equally likely by mutation but the very same pattern of results (data not shown) is obtained when a stepwise mutation model is used (that is, a mutation can only modify the number of repeats by one). When the mutation rate is correctly specified (μsim = μana, values in bold), the estimates of the numbers of generations are quite reliable and the coverage pout-ci is very close to 5%. When mutations are introduced in the simulations but ignored in the analysis (μana = 0), the age of the disease mutation is overestimated, as expected. Conversely, if mutation rates specified in the analysis are larger than those used for the simulations, the age is underestimated. In both cases, this overestimation (or underestimation) becomes really important when μsim = 10−3 and μana = 0 (or μsim = 0 and μana = 10−3). Moreover, the bias is much more problematic with a tight map of markers (θ = 0.002) with an overestimation (underestimation) of ngen around 50% (30%), and pout-ci values around 40%.

Table 2

Influence of misspecifying mutation rates on the estimates of the number of generations according to the true number of generations (ngen) and the recombination fraction between markers (θ = 0.01 or 0.002)

Table 3 presents the impact of the misspecification of allele frequencies for two different recombination fractions (θ = 0.01 and θ = 0.002). Three different true allele frequencies at the markers (0.1, 0.2, and 0.3) are considered and analysed under different assumed frequencies (0.01, 0.1, 0.2, and 0.3). When data are analysed with the correct allele frequencies (bold values of table 3) the proposed correction provides very consistent results except in the situation combining a large number of generations (ngen = 100), a true allele frequency of 0.30, and θ = 0.01. This phenomenon is amplified when data are analysed with misspecified allele frequencies. While the results remain quite satisfactory with ngen  = 10, a more important impact of allele frequencies is observed when ngen  = 100. In this latter case, the number of generations can be severely underestimated (or overestimated) when the allele frequency used for the analysis is underestimated (or overestimated) and θ = 0.01. However, when θ between markers is reduced to 0.002, the influence of the misspecification of allele frequencies on the estimates is much less important, arguing in favour of the use of tightly linked markers. The same observation is obtained in smaller samples (2 or 5 affected individuals) in which the use of low recombination fractions (θ = 0.0005) between markers makes the analysis almost independent of allele frequencies (data not shown).

Table 3

Influence of misspecifying marker allele frequencies on the estimates of the number of generations according to the true number of generations (ngen) and the recombination fraction between markers (θ)

Triple-A study

Twelve markers were genotyped around the AAAS gene, six markers on each side, and the haplotypes of the nine patients at these markers are presented in table 4. For each of these patients, only one of the two haplotypes carrying the mutation was considered since the disorder is recessive and all families were inbred. The shared region is in bold italic in table 4 and all patients were homozygous in this region. Since the genetic and physical distances between D12S85 and D12S1702, the two extreme markers of the haplotype, are 15.9 cM (Genethon map16), and 22.33 Mb, respectively, as calculated by BLAST human genome searches, the correspondence between genetic and physical distances over the whole region was estimated to be 0.712 cM for 1 Mb. Recombination fractions between the different markers and AAAS were then computed from genetic distances using the Kosambi mapping function, and are shown in table 4.

Table 4

Haplotypes encompassing AAAS in nine unrelated patients (P1–P9) sorted according to the length of sharing on the centromeric (left in table) side of the mutation

The age of the most recent common ancestor of the IVS14+1G→A mutation was estimated to be 47 generations (95% confidence interval, 28 to 80) when the marker mutation rate was fixed at 0. Assuming that one generation is 25 years, this corresponds to 1175 years (95% confidence interval, 700 to 2000 years). When marker mutations were considered in the analysis, age estimates were reduced slightly to 46 generations (95% confidence interval, 28 to 78) when μ = 10−4, and to 39 (95% confidence interval, 24 to 67) generations when μ = 10−3.

To investigate whether this result can be affected if some subsets of patients are more related, we estimated the age of the mutation on subsamples of eight patients, removing each individual one after the other (assuming μ = 0). These estimates varied from 41 (when individual P8 or P9 was excluded) to 88 generations (when individual P2 was excluded), and the 95% confidence interval of these estimates always included the initial estimate of 47 generations. Although this is not conclusive evidence against the existence of more related subsets of patients in our sample, this result indicates that this eventuality would have a small impact on the estimate. Moreover, no significant correlation (p>0.4) was observed in the length of sharing between the two sides of the mutation as illustrated in table 4, where patient haplotypes have been sorted according to the length of sharing on the centromeric side of the disease causing mutation.

DISCUSSION

We have presented a simple likelihood based haplotype approach that estimates the age of the most recent common ancestor carrying the disease causing mutation, and not the age of the mutation itself, which would require assumptions about population genetic processes that are usually difficult to validate. As compared with other methods using haplotype information, our approach presents the advantage of being efficient with a very small number of affected individuals, and is thus well suited to estimating the age of rare mutations. As expected, the method is particularly efficient when markers are very polymorphic and provides good estimates as long as shared alleles have frequencies less than 0.20 and these frequencies are correctly specified. An important result was the finding that the influence of allele frequencies on the results strongly decreases as marker distance decreases, suggesting that a helpful way to overcome the allele frequency issue is to refine the map surrounding the mutation. Another haplotype method that differs from ours in the way marker mutations are introduced and in the determination of the most plausible ancestral haplotype was also proposed to estimate the age of mutations in breast cancer genes.17,18 However, no simulations have been performed to investigate the reliability of this method in estimating the age of mutations and, in particular, rare mutations.

The method was used to date the IVS14+1G→A mutation in Triple-A syndrome using data from nine patients of north African origin. We found that the most recent common ancestor of these patients should have lived ≈1000–1175 years (95% confidence interval, 600 to 2000) ago in a period of important migrations in north Africa from the Arabian peninsula.19 In this context, it is worth recalling that one of the first patients described with clinical features of Triple-A was from Saudi Arabia.20 This analysis was performed assuming that all haplotypes observed in the sample diverged independently from a common ancestral haplotype (star genealogy) that carried the mutation. However, other kinds of intra-allelic genealogy can be assumed with subsets of patients sharing a more common ancestry.6 When prior information on the existence of subsets in the data is available (for example, different ethnic groups), an analysis stratified by subsets can be conducted. When no prior information is available, we are exploring strategies to identify subgroups based on haplotype sharing excess (for example, looking for correlation in the length of sharing between the two sides of the mutation which did not provide evidence for the existence of more related subsets of patients in the present Triple-A data).

REFERENCES

View Abstract

Footnotes

  • This work was supported by Fondation BNP-Paribas, Fondation pour la Recherche Médicale, Fondation Schlumberger, and Délégation à la Recherche Clinique (Grant CRC00114).

  • Conflicts of interest: none declared.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.