Introduction

RET is the gene symbol for the ret proto-oncogene located near the centromere of chromosome 10 in band q11.2. The protein encoded by RET is the common component of the receptors for the GDNF family of neurotrophic factors.1,2,3 Gain of function mutations in RET result in dominant oncogenic conversion causing MEN2 (types A and B).4,5 Yet, mutations in RET resulting in loss of biological activity are associated with Hirschsprung disease.6,7 The early linkage studies also showed that RET was within a region of markedly reduced recombination.8 The entire RET genomic sequence has been cloned in a contig of cosmids encompassing 150 kilobases (kb), from the STRP sTCL-2 to the region upstream of the RET promoter.9 The entire genomic sequence of the region is now available (see GenBank accessions AJ243297, AC010864, and AL591116).

Genetic studies of complex disorders have been shown to be more powerful if linkage disequilibrium (LD) exists between susceptibility alleles and normal genetic markers.10 However, little is known about the magnitude and extent of LD in humans except that it varies among loci studied and among populations (eg see Tishkoff et al,11 Kidd KK et al,12 Kidd JR et al,13 Stephens et al,14 Reich et al,15 and Osier et al16). LD can be evaluated by a variety of statistics17,18 that are functions of the haplotype frequencies in the population. Thus, the starting point for studies of LD is determining haplotype frequencies. The expectation–maximization (EM) algorithm gives accurate estimates of the frequencies of common haplotypes19,20,21 especially when there is significant disequilibrium. Those haplotype frequencies can also provide information on evolutionary histories, beyond what can be learned from individual markers.16,22 We are studying RET both in order to understand the evolutionary histories of normal allelic variation at this locus and as an example of a centromeric locus in a region of known reduced recombination. In this study, we examine the haplotype frequencies and LD relations of six RET polymorphisms (Figure 1) in 32 populations distributed around the world. These data reinforce the growing consensus that African populations have significantly less LD than non-African populations.12,14,22,23

Figure 1
figure 1

Physical map of the RET locus and positions of the polymorphisms studied. The ALFRED UIDs for these sites and haplotypes are as follows: intron 1 G/C SNP: SI000753P; exon 2 HaeIII: SI000194O; intron 5 (CA)n: SI000767U; exon13 TaqI: SI000160H; exon 15 RsaI: SI000161I; intron 19 TaqI SI000695U; five-SNP haplotype: SI000861P; six-SNP haplotype: SI000860O.

Materials and methods

Populations sampled

The 32 populations we have typed for six markers at RET are listed by geographic region in Table 1. Sample sizes range from 23 individuals (Nasioi) to 118 individuals (Irish) with a mean sample size of about 53 individuals. Descriptive information and literature citations for these population samples can be found in the allele frequency database ALFRED (http://alfred.med.yale.edu/) under the UIDs in Table 1.

Table 1 Frequencies for the nine most frequent haplotypes

All samples were collected with informed consent from the participants and approval from the appropriate institutional review boards. The DNA in this study was purified by means of standard phenol–chloroform extraction and ethanol precipitation24 from Epstein–Barr virus–transformed lymphoblastoid cell lines25 for all samples.

Polymorphic sites and typing protocols

The six RET markers typed for this study include five biallelic single nucleotide polymorphisms (SNPs) – intron 1 G/C SNP (this study), exon 2 HaeIII,26 exon 13 TaqI,27 exon 15 RsaI,28 and intron 19 TaqI (this study) – and a 13-allele (CA)n short tandem repeat polymorphism (STRP)29 within intron 5, altogether spanning 41.6 kb as shown in Figure 1. All typings were PCR-based, using the primers and protocols described in ALFRED and at the Kidd Lab Web Site.

The three coding region SNPs are synonymous changes described previously. The intron 1 G/C SNP at nucleotide position 8019 of intron 1 was identified by resequencing in our laboratory and was typed on all samples by fluorescence polarization.30 The SNP within intron 19 is a G/A SNP at position 765 of the intron and alters a TaqI restriction site. This SNP was noted in in silico analysis of GenBank sequences and validated in African-American and European-American samples. For the four restriction site markers, the PCR product was digested with the appropriate enzyme, according to the manufacturers' protocols, and the fragments were electrophoresed on agarose gels and stained with ethidium bromide. For the intron 5 (CA)n STRP, the amplification products were run on a 5% polyacrylamide gel on an ABI 377 DNA sequencer. Fragment sizes were determined by the Genescan and Genotyper software. All typing results were entered, as individual phenotypes, into PhenoDB2, our client–server database system for genetic marker data.31

Determining ancestral alleles

The primate ancestral alleles for the three coding SNPs were determined by comparing the homologous sequences obtained from samples of other apes using the logic in Iyengar et al.32 The PCR primers used for typing the human polymorphisms were used to generate template for sequencing from genomic DNA from one chimpanzee, one gorilla, and one orangutan for each site. For the fourth site, intron 19 TaqI site, PCR products of two gorillas and two orangutans were tested by digestion with TaqI. Two chimpanzee samples did not amplify with the same primers. The ancestral allele for the intron 1 G/C SNP has not yet been determined.

Statistical methods

Allele frequencies at the individual sites were calculated by gene counting. The assumption of Hardy–Weinberg ratios was tested for the separate sites by means of an auxiliary program, FENGEN.13 Variation in allele and haplotype frequencies across populations was measured as Fst estimated as σp2/(p̄q̄) for each biallelic site and as the weighted average of the standardized variance for each allele for the STRP.33 For each site and the haplotype, expected heterozygosities were computed as 1−Σpi2 where the pi are the individual allele frequencies. The multisite haplotype frequency estimates were calculated with HAPLO,34 which implements the EM algorithm. The HAPLO program also calculates two kinds of standard error estimates, jack-knife and binomial. Using the haplotype frequency estimates, pairwise LD coefficients were computed both as the conventional pairwise D′ values35 and as Δ2.17 The HAPLO/P18 and PERMSTAT programs were used to compute the overall LD values in the form of ξ coefficients for the five-SNP and the six-site haplotypes. The permutation-based calculations also provide a test of whether overall disequilibrium is statistically significant.18 The HAPLO/P program was also used for the group test of whether the STRP showed significant LD against the 32 background haplotypes defined by the five SNPs, as discussed elsewhere.18,36,37

Results

Primate ancestral alleles

Based on identical sequences for the three non-human primates at the three coding SNP sites, the site-absent or ‘noncutting’ alleles, coded as ‘1’, are ancestral at the exon 15 RsaI (CTAC → GTAC) and the exon 13 TaqI (GCGA → TCGA) sites, while the site-present or ‘cutting’ allele, coded as ‘2’, is ancestral for the exon 2 HaeIII (GGCC → AGCC) site. The intron 19 TaqI site also has a site-present or cutting allele ancestral based on the presence of the restriction site in gorillas and orangutans. The primate sequences for the three coding SNPs have been submitted to GenBank (AF520976-78, AF520980-82, and AF520984-86).

Allele frequencies at individual sites

Marker typings for the six sites have been collected on a total of 1704 individuals. Typing was more than 98% complete across all markers and populations. None of the data deviated significantly from the Hardy–Weinberg expectation. Frequencies and standard errors of all six polymorphisms are given in ALFRED. Allele frequencies for the five biallelic sites are graphed in Figure 2. While frequencies of alleles at the SNPs generally do not differ much among populations within the same geographic region, there are occasional exceptions obvious in Figure 2. Allele frequency variation globally is highly significant at all five sites. With the following few exceptions, all five SNPs are polymorphic in all 32 populations. In the samples of Yoruba and Ibo, only the G allele (site present) occurs at the exon 2 HaeIII site. In the Atayal and Cambodian samples, only the C allele (site absent) is found at the exon 15 RsaI site.

Figure 2
figure 2

Allele frequencies for RET SNPs. Allele frequencies are plotted for the G allele of intron 1 G/C SNP and allele 1 (site absent) of the four other SNPs in 32 populations. Populations are ordered and grouped as in Table 1.

At the intron 5 STRP, 13 different alleles have been seen globally with sizes ranging from 97 to 121 bp in a perfect 2 bp ladder. The frequencies are all in ALFRED. Sequencing of selected homozygotes indicates that this range corresponds to 13–25 tandem repeats of the CA dinucleotide. The numbers and size ranges of alleles are greater in African populations than anywhere else. Only five of the 13 different alleles are globally common: 107, 109, 111, 113, and 115. These account for from 70% of all chromosomes (in the Mexican Pima) up to 95% of all chromosomes (in the Karitiana) with an average around the world of more than 80%. Two of these are always ‘common’: 109 with a mean frequency of 0.313 and a range of 0.138–0.745, and 111 with a mean frequency of 0.298 and a range of 0.123–0.585. The other three are very rare to absent in at least one population.

The Fst values for these polymorphisms, based on these 32 populations, are 0.1 for intron 1 G/C SNP, 0.14 for exon 2 HaeIII, 0.13 for exon 13 TaqI, 0.16 for exon 15 RsaI, 0.18 for intron 19 TaqI, and 0.06 for intron 5 (CA)n STRP. The biallelic Fst values are close to the mean value of about 0.14 that we have obtained for the distribution of Fst values of more than 100 SNPs that we have studied on the same populations.38 The Fst for the STRP site is close to the mean value of 0.07 reported by Calafell et al39 in a series of 45 STRPs typed on 10 populations that are a subset of the 32 population samples reported here and similar to the mean in the larger study by Rosenberg et al.40

Haplotype frequency distributions

The maximum-likelihood estimates of the frequencies of the 32 possible five-SNP haplotypes are given in ALFRED for the 32 populations studied. Of the 32 possible haplotypes, nine occur at a frequency above 10% in at least one population (Table 1). Only one haplotype is present in every population sampled: C2212. Haplotype G2212 is common in populations in all regions except the South Americans, while haplotype G1111 is common in populations in all regions other than African. Haplotype heterozygosity is greater than 0.5 for all populations except one South American group and generally greater than 0.7 (Table 1).

Regional averages of the five-SNP haplotype frequencies are graphed in Figure 3. Although there is frequency variation among populations within each region, especially in Africa, these averages make the regional trends more obvious than the data in Table 1. Including the STRP increases the number of possible haplotypes to 416 of which 175 have a nonzero estimate in at least one population. In most of the world, the most common STRP allele, 109, occurs on the two most common SNP haplotypes, C2212 and G2212. Heterozygosity of the six-site haplotype is generally significantly greater for each population than for the five-SNP haplotype, reflecting multiple STRP alleles on many of the SNP-defined haplotypes.

Figure 3
figure 3

Regional averages of frequencies of common five-SNP haplotypes. Values within parentheses represent number of populations averaged in the geographical region. Data from Table 1.

Linkage disequilibrium

The regional pattern of LD is clear for the five-SNP haplotypes: overall nonrandomness, as quantified by ξ, is relatively low and generally nonsignificant in African populations while relatively higher and generally highly significant (P0.001) in populations from the rest of the world (Figure 4). However, there are individual populations that do not clearly fit the pattern. Two of the African population samples, Hausa and Ethiopian Jews, show borderline significance (0.05>P>0.01). Among those population samples significant at P<0.001, the ξ-value quantifies the nonrandomness and shows considerable variation.

Figure 4
figure 4

LD measures. ξ-values are plotted for overall nonrandomness for the five-SNP RET haplotypes (triangles) and for the association of STRP alleles with the five-site haplotypes (squares). Solid triangles indicate P0.001; open triangles indicate P>0.02 except Hausa (P=0.002) and Ibo (P=0.004). Solid squares indicate P<0.01; open squares indicate P>0.1 except Russian (P=0.016), Japanese (P=0.06), Ami (P=0.03), and Rondonian Surui (P=0.04).

The overall LD pattern for the six-site haplotypes is similar to the five-SNP LD pattern but is at a considerably higher level and the statistical significance levels are stronger (data not shown). Figure 4 also shows the association of the STRP with the five-site backbone haplotypes. Except in sub-Saharan African populations, the association is generally significant at P<0.01. Four populations showed marginally significant associations at 0.1>P>0.01: Russians, Japanese, Ami, and R. Surui.

Pairwise LD values, as Δ2, are given in Table 2. While there is variability within each geographical region for the pairwise LD results, the overall trend across the 32 populations is remarkably similar for all four independent chromosomal segments (in boldface) defined by the five SNPs although the intervals differ in size (14.7, 17.8, 1.8, and 7.3 kb). These pairwise disequilibrium values are on average smallest in Africans, somewhat larger in Europeans and Eastern Asians, and largest for Native Americans. Interestingly, however, the values for non-African populations are consistently higher for the two longest of the four regions except for the Native American populations. The two smaller regions show the greatest similarity in LD values across populations (Table 3). With two exceptions, the Δ2-values in Table 2 that are >0.21 have permutation-based P-values <0.010 while all 48 LD values 0.5 have P-values <0.001. The two exceptions involve the Nasioi where LD values of 0.3–0.4 were not significant at the 1% level. For the remaining 87 Δ2-values in the range between 0.21 and 0.50, all have P-values <0.010 while 64% have P-values <0.001. For the Δ2-values 0.21, 30% are significant at the 1% level and a third of these have P-values <0.001.

Table 2 Pairwise LD as Δ2 for all pairs of SNPs
Table 3 Correlation coefficients of the pairwise Δ2-values across populations

Discussion

Frequency estimates and variation

Extensive genetic variation is shown at all of the six polymorphisms that we have thus far studied at the RET locus. The average heterozygosities for the five SNPs are highest in the European samples. This is not surprising since three of the five SNPs were originally identified as RFLPs in samples of individuals of European ancestry. This ascertainment bias is undoubtedly the explanation for that aspect of the global patterns. The heterozygosities are generally lower in the African populations with considerable variation among the populations and SNPs. Outside of Africa, the heterozygosities of the exon 2 HaeIII and intron 19 TaqI sites are highly correlated and generally high (data not shown) as apparent graphically in Figure 2. The correlations are themselves indicators of LD between those two sites. In contrast, the exon 15 RsaI site shows a different pattern with lower heterozygosities in eastern Asian populations and higher heterozygosities in North American Indian populations. Although it is not a striking difference, the heterozygosity for the STRP is higher in Africans than elsewhere. Again, this is not surprising and reflects the well-known higher levels of variation in African populations.41,42 As a result, the heterozygosities of the six-site haplotypes are, on average, higher in African than European populations. This pervasive genetic diversity makes it feasible to compare LD for the different intervals and populations. The 32 populations studied here are sufficient in number with diverse enough geographic origins to allow many general conclusions about the RET locus.

The haplotype frequency estimates in Table 1 that are 10% should be accurate estimates of the common haplotypes in these populations and reflect the global variation in haplotype frequencies. Haplotypes with true frequencies of 1/2N might not have been observed in our samples. Also, haplotypes estimated to be absent in a population might actually have been present in the sample but not unambiguously so. Conversely, haplotypes with estimated frequencies on the order of 1/N or smaller may not actually be present.19 In all, 19 of these populations had more than 40% of the individuals with unambiguous marker phenotypes (homozygous for four or five sites) for the five-SNP haplotypes. A total of 10 populations had 31–39% of the individuals with unambiguous marker phenotypes. The smallest percentages were in the Finns with only 19% unambiguous marker phenotypes. Thus, the haplotype frequency estimates are usually based on considerable phase-known data.

The haplotype frequency data (Table 1) and the regional averages (Figure 3) show that the haplotype frequencies do reflect the geographical clustering of our samples. Therefore, the RET polymorphisms will be useful markers for studies of population relationships, especially on a global level. As can be seen in Figure 2, individual sites can be selected as more informative for comparisons among populations from specific regions. For example, the intron 19 TaqI site is not informative among sub-Saharan Africans, but the exon 13 TaqI site and the exon 15 RsaI site should be very informative among such populations.

Linkage disequilibrium

Although only six sites across 41.6 kb are involved, the RET data illustrate many of the complexities of studying LD in human populations. The classic approach to LD is use of a coefficient such as D and D35 or other measure17,43 to quantify the pairwise associations of alleles on chromosomes in a population. Recently, researchers have begun to consider defining the segments of DNA within which only a few haplotypes account for the majority of the chromosomes in a population.23,42 The value of this ‘hap-map’ approach is that it may be possible to identify the very small subset of the SNPs in a block that will serve to identify and discriminate among these common haplotypes. Although a short segment, RET is interesting to consider from this second perspective. In all non-African populations, there is highly significant overall LD for the five-SNP haplotypes. However, in European and Southwest Asian populations, this does not translate into a subset of these markers being sufficient to identify all common haplotypes. The most common haplotype is never >30% and except for the Adygei and Russians who require more, six different haplotypes are required to reach a cumulative frequency of 90%. All five SNPs are required to define these six haplotypes. Some other regions of the world require fewer haplotypes to account for the majority of chromosomes in a population because of reduced heterozygosity. This reduced heterozygosity can be explained as ascertainment bias for high heterozygosity of each site in Europeans. However, even where only three haplotypes account for >80% (eg Native Americans), three of the four SNPs are still required to discriminate among them.

The test of LD of the STRP against the background haplotypes provides evidence on the mutation rate at the STRP and on recent human evolution. The absence of significant LD in African populations but the presence of significant LD in most non-African populations argues that the African populations have existed for a longer time than the non-African populations. The finding of several non-African populations with nonsignificant LD indicates that the mutation rate at the STRP is at least moderate relative to the time since founding of the non-African populations. This is also indicated by the occurrence of more than one STRP allele on evolutionarily derived haplotypes.

In contrast to an expectation of a negative correlation between pairwise LD and interval length (higher LD for shorter intervals, lower LD for longer intervals), we find that the two longer of the four independent intervals have generally higher LD than the two shorter intervals (Table 2). In the African populations, all intervals generally have nonsignificant (or noncalculable) LD values, but the longer internal intervals, intron 1 G/C SNP to exon 2 HaeIII and exon 2 HaeIII to exon 13 TaqI (17.8 kb), have markedly increased LD for non-African populations. This pattern suggests that the founder effect associated with the expansion of modern humans out of Africa established this pattern of relatively higher LD for the longer regions and lower LD for the shorter regions. Subsequent random genetic drift of different magnitudes for different populations and another founder effect associated with migration into the Americas would then modify this general pattern. The situation is more complex, however, because the pairwise LD spanning the 9.1 kb between exon 13 TaqI and intron 19 TaqI is generally much higher in all non-African populations than the LD across either of the two internal intervals exon 13 TaqI to exon 15 RsaI and exon 15 RsaI to intron 19 TaqI. In fact, for many populations this is the largest Δ2-value. The correlation coefficients in Table 3 show that these two smaller segments have a very similar pattern of LD among populations (r=0.94), while neither is as highly correlated with the LD for the region encompassing both (r=0.47 and 0.53). Since the same few haplotypes are involved in all cases, the unusual pattern does not relate to recombination but to frequency differences among the haplotypes. Such complex patterns of LD emphasize that LD is a statistical phenomenon, not an inherent property of a segment of DNA and that random genetic drift, more than ‘hot spots’ of recombination, may be the major factor determining patterns of LD.

In a study of Hirschsprung disease in the genetically isolated Old Order Mennonite community, Carrasquillo et al.44 found a ‘block’ of LD at the 5′ end of the gene that was strong in chromosomes transmitted to Hirschsprung patients and less pronounced in the untransmitted chromosomes from those families. A more diffuse ‘block’ was also present in the 3′ part of the gene in both sets of chromosomes. The untransmitted chromosomes are likely to represent random chromosomes from the Old Order Mennonites. The strong LD patterns they found at the 5′ and the 3′ ends of RET may well be due to a founder effect in their sample of Mennonites and differ from what we find with our more limited number of markers in our samples of normal variation in most populations including the European populations. Three of the markers, intron 5 (CA)n, exon 13 TaqI, and exon 15 RsaI, were the same as we have typed. Our intron 1 and exon 2 markers fall within the 5′ block of Carrasquillo et al and our exon 13, exon 15, and intron 19 markers fall within their 3′ block. In just over half of the populations we studied, the LD between the intron 1 and exon 2 markers (‘intrablock’) is less than the LD between the exon 2 and exon 13 marker (‘interblock’), two intervals of roughly equal length. As noted above, the LD values for the exon 13 to exon 15 and the exon 15 to intron 19 intervals, which are both within the ‘3′ block’, are generally much smaller even though the interval lengths are shorter.

Recently, a founder haplotype, possibly accounting for many cases of Hirschsprung disease in Spaniards, was described using several SNPs at RET.45 The exon 2 HaeIII, exon 13 TaqI, and exon 15 RsaI in the present paper were included in that study along with others. By extrapolation of the association of these markers with Hirschsprung disease, a possible susceptibility variant close to exon 1 of the gene was postulated. The distance of the extrapolation was 20 kb. It is generally impossible to predict association across such distances because, as shown here for these and other RET SNPs, the patterns of LD are complex and do not follow simple regressions with molecular distance. However, the new intron 1 G/C SNP will now allow that hypothesis to be more robustly tested. The LD among markers in normal individuals bears no necessary relation to the association expected for any one marker and a disease susceptibility allele. Therefore, our study does not predict what might be found for association of this intron 1 G/C SNP with Hirschpsrung disease.

Haplotype variation at RET contrasts with the pattern seen at DRD2,12 PAH,13 and COMT37 for many of the same population samples. At those other loci, there were more haplotypes and higher heterozygosities in the African populations. However, the data are consistent among all these loci in showing less LD in the African samples than elsewhere. Thus, RET is another locus that shows low levels of LD in multiple African populations strengthening the conclusion that low levels of LD is a characteristic of African populations in general. While the SNP-defined haplotype diversity at RET does not support an Out-of-Africa model of recent human evolution, the STRP diversity and its disequilibrium with the background, SNP-defined haplotypes (Figure 4) do support an Out-of-Africa model with a significant founder effect in the ancestry of the non-African populations.

Electronic databases cited

ALFRED (ALlele FREquency Database): http://alfred.med.yale.edu/alfred/

Kidd Lab Web Site: http://info.med.yale.edu/genetics/kkidd

Genbank:http://www.ncbi.nlm.nih.gov/Genbank/index.html