Introduction

Considerable attention is currently being focused on the genotyping of single-nucleotide polymorphisms (SNPs), and on their utility for linkage disequilibrium (LD) mapping studies that hope to identify susceptibility genes for complex diseases. In the context of LD mapping, it would be important to identify regions of conserved LD or, alternatively, of high recombination and gene conversion throughout the human genome. One goal here is to detect the common haplotypes such that the number of SNPs that need to be genotyped for a gene mapping study can be reduced.1,2,3,4,5 However, not only do recombination rates vary greatly across the genome6 but differences in population history and structure cause the extent of LD and mutational load at different genes to be population specific.1,2,4,7 Furthermore, in the case of complex diseases – where many loci may confer susceptibility – ethnically divergent populations may exhibit the same phenotype but this phenotype may not necessarily be caused by the same set of susceptibility loci. This assumption has been largely hypothetical; the recently described association of Crohn's disease with CARD15 provides an ideal test case.

Crohn's disease (CD; MIM 2666600) and ulcerative colitis (UC; MIM 191390) represent the two major forms of inflammatory bowel disease (IBD; MIM 601458). These diseases are characterized by chronic relapsing inflammation of the gastrointestinal tract.8,9 The prevalence of IBD in some western countries is as high as 0.5%.10,11 Consistent evidence for familial clustering,12 an increased concordance of the IBD phenotype in monozygotic twins13,14 and consistently positive results from genetic linkage studies have repeatedly confirmed the involvement of complex genetic factors in the etiology of these conditions.

Genome-wide linkage analyses have detected several susceptibility regions on different chromosomes and the linkage region on chromosome 16, IBD1.15 This has been replicated in several independent Caucasian populations.16,17,18,19,20,21,22,23,24,25 Recently, mutations in the leucine rich region (LRR) of the CARD15 (NOD2) (MIM 605956) gene on chromosome 16q12 have been discovered. These mutations are strongly associated with CD in populations of European descent.26,27,28

CARD15 is a member of the APAF-1/CED-4 family of genes. Genes from this family show some structural similarities to plant NOD resistance genes, and have been implicated in pro-inflammatory cytokine induction and apoptosis pathways involving TNFα and NFκB, with CARD15 exhibiting monocyte-specific expression.28,29 Preliminary functional evidence has suggested that the LRR regions of CARD15 may play a role in the response to bacterial lipopolysaccharides by altering the activation of NFκB.28,30 However, the exact functional and molecular role of CARD15 in the immune response remains unclear.31

Three mutations in the LRR have been implicated in CD. Two of these, C14772T (R702W) in exon 4 and G25386C (G908R) in exon 8 (labelled SNP8 and SNP12, by Hugot et al;27 from here on referred to as o8 and o12) cause amino acid substitutions. The insertion of a C in exon 11, 32629insC (1007insC) (labelled SNP13, here=o13) causes truncation of the protein. It has been suggested that the three mutations could alter activation of NFκB through inefficient CARD15 dimerization or impaired recognition of microbial components.27,28

Hugot et al.27 indicated that the CARD15 mutations never occur on the same chromosome and identified three haplotypes that show markedly distorted transmission in affected nuclear families. These three haplotypes are identical except that each carries one of the three mutations. Although, the three mutations are present on other haplotypes these are too rare to allow detection of disease-association.

Three independent causative mutations occurring on the same haplotype background does not appear parsimonious and suggests that a further, truly causative or more strongly predisposing mutation may exist that is in some LD with the three LRR mutations. Presence of the haplotype carrying this unknown mutation would thus be a prerequisite for any other variation to be classified as disease-associated.

It was decided, therefore, to perform a detailed study of the genetic variation in and around the entire CARD15 gene. The aim was twofold. First, to determine whether CARD15 confers susceptibility to CD in an ethnically and historically distinct Asian population – susceptibility that has previously been demonstrated for several populations of predominantly European descent. To this end, data from a South Korean cohort are presented and patterns of LD and underlying haplotype structure are compared between Europeans and Asians. Second, the possibility of a common mutation in LD with the previously described, putatively causative LRR mutations is explored.

Subjects and methods

Subjects

British families (75 CD-affected pedigrees containing 162 CD-affected individuals and 72 affected sib-pairs (ASPs)) and German families (144 CD-pedigrees; 265 CD-individuals, 124 ASPs) were recruited by an international group of IBD investigators at the Charité University Hospital (Berlin, Germany), the Department of General Internal Medicine at Christian-Albrechts-University (Kiel, Germany), St Mark's, Guy's and King's College Hospitals (London, UK) and other German centres. These cohorts have been described in previous studies.16,26 Additionally, German trios (307 patients with sporadic CD and two unaffected parents) were recruited. Normal controls (370 individuals) of German origin were obtained through the Department of Transfusion Medicine at the Kiel University Hospital. South Korean sporadic CD cases (126 individuals) and unrelated healthy controls (116 individuals) were recruited at the Department of Internal Medicine, University of Ulsan College of Medicine, Asan Medical Center, Seoul and the Department of Internal Medicine, Yonsei University College of Medicine, Seoul.

The diagnosis of CD was confirmed by clinical, radiological and endoscopic (type of lesions, distribution) analyses.32,33 Additionally, histological findings had to be confirmative or complementary with this diagnosis. The diagnosis in the Korean patients was additionally controlled by an observational visit from the Kiel group. EDTA blood was obtained from all study participants. Informed, written consent was obtained from all study participants and recruitment protocols were approved by ethics committees at participating centres before commencing the cohort assemblies.

Methods

The CARD15 gene was screened for SNPs by genomic and cDNA resequencing in 47 IBD affected German patients (24 CD, 23 UC). Searching for new SNPs only in affected individuals maximizes the likelihood of finding disease associated mutations (however, we note that putative protective polymorphisms may be overlooked, on occasion, using this strategy). Mutation detection focused on the exons, exon–intron boundaries and on 1 kb regions flanking the gene at approximately 50, 100, 150 and 200 kb up and downstream of the ATG. Primers for PCR amplification were designed on the basis of GenBank sequences NT_027173 and XM_012541). In total, 23 polymorphisms are reported. These include those previously reported as associated with CD susceptibility (accession numbers to dbSNP: ss2978533, ss2978538, ss2978540-43, ss2992220-24, ss2992238-39, ss2992242, and ss4383587-95).

The SNPs were genotyped using Allelic Discrimination by Taqman Technology with an ABI 7700 Sequence Detector (Applera, Foster City, CA, USA) using the primers and probes as outlined in Table 1.

Table 1 TaqMan assay primer and probe sets

The same regions were screened for mutations in 47 Korean CD patients by genomic resequencing with genotyping and confirmation of sequencing results using the Taqman assays established in the European samples.

The data were checked and managed by means of an integrated database system.34

Statistical analyses

Each marker was tested for Hardy–Weinberg equilibrium in the control populations using a χ2 test. Genetic analyses were then performed at several levels. To confirm the association with CD, each marker was first subjected to single-locus tests for linkage as follows. The UK families, German families and German trios were examined for distorted transmission using the TRANSMIT program35,36 with significance levels verified using 1000 bootstrap replicates for each test. A case–control analysis was performed against unrelated controls on the Korean dataset, and also on the European data after randomly extracting a single affected offspring from each family or trio. In all analyses, the UK and German data were pooled into a single European cohort. The validity of such pooling was verified by comparing the allele frequencies at each marker in the random cases using χ2 statistics or Fisher's exact test, as appropriate (data not shown). Genotype-based odds ratios (OR) were calculated and association tested similarly.

Pair-wise LD between each marker pair was calculated as D=pijpipj, where pij is the frequency of haplotype carrying allele i at the first locus and allele j at the second locus, and pi and pj are the frequencies of alleles i and j. This was transformed into the two standardized LD coefficients, r2 and D'. Here, r is the allelic correlation coefficient given by D/(pipj[1-pi][1-pj])1/2.37 D' was computed as Dij/Dij,max, where Dij,max is the maximum LD possible for two markers with allele frequencies pi and pj, calculated as min (pipj, [1-pi][1-pj]) if Dij <0 or min ([1-pi]pj, pi[1-pj]) if Dij >0.38

Haplotypes frequencies were estimated from phase-unknown and phase-known genotypic data using the Expectation Maximisation (EM) algorithm.39 All individuals with incomplete genotypes were removed from the analysis. The haplotype frequency estimates from the EM algorithm were cross-checked against the European family and trio data, using the program GENEHUNTER 2.

The EM algorithm uses maximum likelihood approaches to estimate haplotype frequencies from partially phase-unknown genotypic data. Large numbers of markers generate prohibitively intensive searches and therefore subsets of informative markers had to be selected. This choice was based both on the comparison of the two populations and the patterns of LD observed between the markers (see Results).

The null-hypothesis that there is no difference in overall estimated haplotype frequencies between cases and controls could be tested using a likelihood ratio test. However, estimated haplotype frequencies cannot be treated as observed data and therefore no valid statistical test exists for evaluating the possible contribution of individual haplotypes to any deviation from this null-hypothesis. Consequently, a robust permutation test for haplotype association was performed. Pseudo-χ2 statistics were calculated both for the overall haplotype tables (global test) and the individual haplotypes in the tables. The significance of these statistics was evaluated by shuffling-together and repartitioning the case and control individuals, re-estimating the haplotype frequencies and then re-calculating the pseudo-χ2 statistics 10 000 times. Evaluating the contribution of individual haplotypes is only strictly relevant if the null-hypothesis of no difference in overall estimated haplotype frequencies between cases and controls (the global test) is rejected; this evaluation does not represent a new statistical hypothesis and therefore correction for multiple-testing is not necessary.

The NETWORK 3.0 program was used to infer the likely genealogical history between the most frequent haplotypes using the Median–Joining (MJ) algorithm.40 In an MJ network, circles represent distinct haplotypes and are scaled to reflect the frequency of these haplotypes. The branches connect the haplotypes and indicate the mutational steps between the haplotypes. The MJ network was generated for the set of European case and control haplotypes.

Results

CARD15 diversity at the nucleotide level

The CARD15 gene consists of 12 exons spanning 35.9 kb of genomic sequence and encoding an mRNA transcript of 4486 bp. All exons and exon–intron boundaries of the CARD15 gene, plus 1 kb regions flanking either side of the gene at 50, 100, 150 and 200 kb intervals, were screened in 47 patients (94 chromosomes) each, from both the European and Korean CD samples. In the European patients, a total of 23 SNPs, spanning 290 kb, were confirmed and genotyped, giving a mean density of one SNP per 12.6 kb (one SNP per 2.3 kb in the CARD15 coding region). Only 10 of these SNPs were present in the Korean samples, and no additional variants were identified in this population. The absence of variants on sequencing was further confirmed using the TaqMan assays, previously established in the European population. Most notably absent were SNPs R702W, G908R, and 1007insC, which correspond to the disease-associated SNPs (o8, o12, o13) as outlined by Hugot et al.27 Indeed, only two of the SNPs described by Hugot et al.27 were present in the Korean sample – SNPs 10 and 14 (o7 and o9). The SNPs with their positions, nomenclature, coding status and frequencies are summarized in Table 2.

Table 2 SNPs and association statistics in Europeans and Koreans

Table 2 also outlines the results of the single-locus tests of association for each SNP. In the European samples, the TDT and case–control results are complementary, with non-significant results obtained for the most distal flanking markers (SNPs 1, 22, 23) and also for SNP 19, but consistently significant association for all other markers within, or close to, the gene. In the Koreans, none of the markers exhibits any significant association to CD as assessed by a case–control design. Only SNP 2 and SNP 22 suggest marginal significance, which dissipates on correction for multiple testing and is not supported by two-marker haplotype analysis (data not shown).

LD in CARD15

LD was studied between CD associated SNPs 1007insC, G908R and R702W (o13, o12, o8 respectively) and the remaining SNPs. These results (Figure 1) clearly highlight the problems inherent to LD metrics r2 and D'. According to r2, SNPs 1007insC, G908R and R702W are not in linkage disequilibrium with each other, nor with any other markers. The low r2 values most likely result from the low frequency of the rare allele of each SNP (see Table 2); r2 is known to be highly sensitive to skewed allele-frequencies.7,41 The values of D' are, however, also problematic. For many combinations, D' values of 1.0 result and these again are mainly due to the low frequencies of SNPs 1007insC, G908R and R702W. By definition, the presence of only three of four possible haplotypes results in D' being equal to unity. Measures of marker–marker LD, as employed here, are therefore not helpful for deciding if rare disease-associated variants such as SNPs 1007insC, G908R and P268S (SNP8) are co-segregating with other (perhaps causative) mutations.

Figure 1
figure 1

Pair-wise LD between 1007insC (A), G908R (B), and R702W (C) and other CARD15 SNPs. Both r2 (dotted line, triangle) and D' (solid line, diamond) are displayed.

LD was also calculated between other pairs of markers. Figure 2 shows how D' and r2 values decrease with distance between markers in the European samples. A value of r2>0.1 has been suggested as a criterion for meaningful LD7,42 whereas a cut-off of r2>0.5 is perhaps more useful for visualizing LD-groups.7 Examining Figure 2 from this perspective suggests that useful LD in CARD15 declines sharply with distance and extends maximally to between 50 and 100 kb. Overall, LD in the European sample and the Korean sample appeared similar (mean pair-wise r2 between markers: European=0.18±0.03, Korean=0.24± 0.09; mean pair-wise D' between markers: European= 0.68±0.05, Korean=0.65±0.11).

Figure 2
figure 2

Pair-wise LD and distance (kilobases) between 23 SNP markers in Europeans. LD declines sharply with increasing distance. (A) Absolute values of D'. (B) r2.

Figure 3 illustrates pair-wise LD between all marker pairs for the European (lower diagonal) and Korean (upper diagonal) control populations. D' values are explicitly presented whilst, r2 values >0.5 are indicated in black, and r2 values between 0.25 and 0.5 are highlighted in grey. Following the methodology of Nakajima et al,7 r2 values >0.5 were used to identify LD-groups (Figure 3, bottom). With the exception of SNP2, all SNPs present in the Korean controls fell into one LD-group. In the European sample, two LD-groups were apparent – one coinciding almost exactly with the Korean markers and one containing the other markers, except SNPs R702W G908R, 1007insC, 21, 1, 2, 22, 23 (the latter four, which are the most distal, showed little overall linkage disequilibrium with the other SNPs).

Figure 3
figure 3

Pair-wise LD of SNPs in CARD15 as measured by D' (numbers) and r2 (shading). Top diagonal: Korean. Bottom diagonal: European. LD-groups defined by markers with r2 values >0.5 are illustrated below.

Haplotype analysis

Haplotypes were constructed using an EM algorithm with two sets of SNPs; those shared between Europeans and Koreans (set 1), and those exclusively found in Europeans (set 2). SNPs 1, 2, 22 and 23 were eliminated from the haplotype analysis on the grounds of too little LD. Only haplotypes with estimated frequencies greater than 1% in the combined cases and controls were considered. Haplotype frequencies for SNP set 1 were analysed for a difference between cases and controls in both the Korean and European samples. The estimated haplotypes and corresponding statistics are given in Table 3. For both populations, the global permutation test indicated no association with CD. Furthermore, and in agreement with this, no individual haplotypes indicated significant association with CD. Given the lack of association in either population for haplotypes inferred from the shared LD-group (SNP set 1), plus the lack of single point association for these SNPs in the Korean sample, it seems unlikely that any of these SNPs are directly involved in the aetiology of CD.

Table 3 Haplotypes from the shared SNP set 1

Haplotype frequencies were also estimated for the European case and control samples using only the markers unique to Europeans (set 2; Table 4). The global permutation test was highly significant. Four haplotypes, designated H1, H2, H5 and H7, were negatively associated with CD (combined OR=0.295; Wald 95% CI=0.228–0.382; ORs are haplotype ORs derived from the estimated haplotype frequencies). H2, the second most common haplotype, represents the Korean haplotype and is therefore probably ancestral. Three haplotypes, H3, H4, and H8, were positively associated with CD (H3: OR=4.857, Wald 95% CI=2.923–8.075; H4: OR=2.975, Wald 95% CI=1.831–4.834; H8: OR=27.493, Wald 95% CI=3.409–221.750). H3 carried the mutant form of SNP15 (o13, 1007insC), H4 carried SNP11 (o8, R702W) and H8 carried SNP12 (o12, G908R). Taken together these three haplotypes account for 30.2% of the chromosomes in the CD sample and only 7.7% of the chromosomes in the control sample. The overall OR was 5.186 (Wald 95% CI=3.635–7.400).

Table 4 Final European haplotypes (SNP set 2)

Figure 4 shows the MJ network for the 12 most frequent haplotypes occurring in the European (set 2) combined case and control sample (92% of the chromosomes). The topology was identical if only the controls (or cases) were analysed. The squared box represents the ancestral haplotype shared between the Korean and European population. The mutational steps (SNPs) between each node are marked and the nodes are scaled relative to the count of the haplotype. SNP3 was rejected from the analysis as uninformative due to homoplasy (recombination). This resulted in the identity of haplotypes H5 and H7 (indicated as H5 in Figure 4). Overall, the MJ network placed the haplotypes into two groups. The putatively ancestral haplotype H2, shared between the Korean and European populations, along with the common H1 dominated one half of the network. Most other haplotypes, including all the positively and negatively associated haplotypes fell into a complex grouping distinct from these common haplotypes. If the mutational steps across the network are examined then it is apparent that the only mutations unique to the positively disease-associated haplotypes H3, H4 and H8 are SNPs 1007insC, R702W and G908R respectively. All other mutations are shared by other haplotypes not positively associated with CD (see Table 4). Therefore it seems unlikely that any of the SNPs examined, other than SNPs R702W, G908R and 1007insC, is strongly implicated in CD susceptibility.

Figure 4
figure 4

MJ network of the 12 most common haplotypes in the European case and control samples. Arrows link the unique haplotypes and indicate the mutational relationships between them. H5 includes H5 and H7. The size of a node is approximately proportional to the frequency of that haplotype in the total sample (the most frequent haplotypes have been downscaled for clarity). Mutational steps are indicated on the branches (variants numbered as in Table 2). The squared box indicates the shared ‘ancestral’ European and Korean haplotype.

Discussion

Although the exact nature of the molecular role of CARD15 (NOD2) in mediating the immune response remains unclear,31 the fact that it is involved in inflammatory disorders seems unequivocal. Not only have mutations in the LRR of the CARD15 gene been repeatedly reported in association with CD26,27,28 but mutations in the nuclear binding domain (NBD) have now been implicated in Blau syndrome (BS; MIM 186580),43 another granulomatous disorder with histological similarities to sarcoidosis (MIM 181000). However, the NBD mutations appear to be restricted to familial BS and have not so far been observed in the general population.43

The results presented here illustrate a number of important points pertinent to the mapping and characterization of disease genes, not only for CD but for complex disorders in general. The single locus tests of association for the European samples (Table 2) highlight that, given a sufficiently large sample, consistent association between CD and variation in the CARD15 gene can be detected throughout the length of the gene and in the surrounding area (in the case of CARD15 a region of around 100 kb). This result implies that in a well-designed experiment, one would be unlikely to overlook the association with CD. On the basis of association alone, however, it is impossible to ascertain which, if any, of the SNPs may be causative.

Measuring pair-wise LD between markers may also not be particularly informative. The values of D' and r2 between SNPs R702W, G908R and 1007insC and all other SNPs yielded very little information about actual disease-association. For example, if association had initially been detected to SNP 3 during an SNP based genome scan, subsequent assessment of pair-wise LD using r2 values would have failed to find linkage disequilibrium to SNP R702W, G908R and 1007insC. LD therefore, does not provide a short-cut; full genotyping of patients is required at each marker. As previously mentioned, the values of r2 may have been so low because the predisposing mutations are rare. For similar reasons, D' values often attain their maximum value of 1.0. This happens whenever one or more of the four possible haplotypes between two markers is absent. One reason that a marker allele may be rare is because it has arisen recently – this means that there has been no time for the fourth haplotype to appear. Hence D' is not a sensitive measure of LD to recent and/or rare mutations.7

Allowing for the limitations given above, pair-wise measures of LD were quite high between all SNPs. Figure 3 illustrates how substantial LD in European CARD15 genes appears to extend between about 50 and 100 kb. This fits with the apparent average range of 60 kb ‘typical’ for genes in populations of northern European descent.1

Use of r2 revealed two LD-groups in the Europeans (excluding a number of apparently unlinked markers – notably the rare SNPs R702W, G908R, 1007insC and 19, and the distantly flanking markers). One LD-group reflected the set of SNPs and the LD-group observed in the Korean population, a result that makes intuitive sense. These markers were generally of moderate to high polymorphism and no doubt represent relatively ancient and ethnically shared variation. The ORs of less than 1.0 shown by most of these markers in the Europeans, in addition to the lack of single-point association with CD in the Korean population and the lack of association in both populations at the haplotype level, argues strongly against their involvement in CD. This result demonstrates the value of a population comparative approach to identifying the causative variations in susceptibility genes for complex disease.

The analysis of the European population using the set of markers not shared with the Koreans yielded few haplotypes. The results corroborated those of Hugot et al.27 in that the three putatively causative mutations did not occur together on a single haplotype but shared a common background haplotype. From these results, it is still not feasible to distinguish between causality of these mutations and of others on the same background haplotype. However, a genealogical network approach, placing the haplotypes into a network, allowed exclusion of all other markers except SNPs R702W, G908R and 1007insC as predisposing to CD. Consequently, from our set of 23 SNPs no evidence was found for a further causative variant that unites SNPs R702W, G908R and 1007insC within a common background haplotype. Therefore it is not possible to discount the argument that these three SNPs truly are the causative variants. However, it still appears plausible that such a variant may exist as yet undetected, perhaps in an upstream promoter element. Indeed, in an exhaustive re-sequencing of the CARD15 coding region of 457 CD patients, 159 UC patients and 103 unaffected unrelated individuals, Hugot et al27 found a large number of additional but extremely rare missense variants that may also be associated to CD. If an unique causative variant remains to be discovered then the search may be targeted at individuals carrying the common background haplotype for SNPs R702W, G908R and 1007insC.

The fact that SNPs R702W, G908R and 1007insC are associated with a common background haplotype has probably been fortunate since the presence of several disease-predisposing alleles within a susceptibility locus, each in association with a different background haplotype can seriously compromise the ability to locate the susceptibility locus by LD mapping.7,44 If the apparently causative SNPs R702W, G908R and 1007insC had not shared a common background haplotype then association may have been much harder to detect. The mutant alleles of all three mutations are rare (in the control sample 4.70, 0.60 and 5.00% respectively). If they had unconnected origins and resided on unrelated haplotypes then they may have obfuscated each other in the single locus association tests. Although haplotype analyses would have resolved this, the likelihood of not observing the association to CARD15 in the first place would have been much greater.

The lack of disease-association of CARD15 in an Asian population which experiences CD of equivalent phenotype and incidence as the European population highlights the importance of ethnic comparisons in identifying the susceptibility genes for complex disorders. The combination of examining ethnically shared variants and genealogical reconstruction of haplotypes can be a powerful tool in narrowing the search for causative mutations.

Electronic-database information

Accession numbers and URLs for data in this article: GenBank, http://www.ncbi.nlm.nih.gov/Genbank; Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/ (for CD (MIM 2666600), for UC (MIM 191390), for IBD (MIM 601458), for BS (MIM 186580) and for CARD15 (MIM 605956)); dbSNP, http://www.ncbi.nlm.nih.gov/SNP/; NETWORK 3.0, http://www.fluxus-engineering.com