Article Text


Estimating risks of common complex diseases across genetic and environmental factors: the example of Crohn disease
  1. C M Lewis1,2,
  2. S C L Whitwell3,
  3. A Forbes4,
  4. J Sanderson5,
  5. C G Mathew1,
  6. T M Marteau3
  1. 1
    Department of Medical and Molecular Genetics, Division of Genetics and Molecular Medicine, King’s College London School of Medicine, Guy’s Campus, London, UK
  2. 2
    Social, Genetic and Developmental Psychiatry Research Centre, Institute of Psychiatry, King’s College London, London, UK
  3. 3
    Psychology and Genetics Research Unit, King’s College London School of Medicine, Guy’s Campus, London, UK
  4. 4
    Department of Gastroenterology, Institute for Digestive Diseases, University College London Hospitals Trust, London, UK
  5. 5
    Department of Gastroenterology, Guy’s and St Thomas’ NHS Foundation Trust, London, UK
  1. Professor Cathryn M Lewis, Department of Medical and Molecular Genetics, King’s College London School of Medicine, 8th Floor, Guy’s Tower, Guy’s Hospital, London SE1 9RT, UK; cathryn.lewis{at}


Background: Progress has been made in identifying mutations that confer susceptibility to complex diseases, with the prospect that these genetic risks might be used in determining individual disease risk.

Aim: To use Crohn disease (CD) as a model of a common complex disorder, and to develop methods to estimate disease risks using both genetic and environmental risk factors.

Methods: The calculations used three independent risk factors: CARD15 genotype (conferring a gene dosage effect on risk), smoking (twofold increased risk for smokers), and residual familial risk (estimating the effect of unidentified genes, after accounting for the contribution of CARD15). Risks were estimated for high-risk people who are siblings, parents and offspring of a patient with CD.

Results: The CD risk to the sibling of a patient with CD who smokes and carries two CARD15 mutations is approximately 35%, which represents a substantial increase on the population risk of 0.1%. In contrast, the risk to a non-smoking sibling of a patient with CD who carries no CARD15 mutations is 2%. Risks to parents and offspring were lower.

Conclusions: High absolute risks of CD disease can be obtained by incorporating information on smoking, family history and CARD15 mutations. Behaviour modification through smoking cessation may reduce CD risk in these people.

  • Crohn disease
  • genetics
  • CARD15
  • risk estimation

Statistics from

Crohn disease (CD) is a severe inflammatory disorder of the gastrointestinal tract. The disorder can affect any part of the gut, but primarily the ileum and colon are affected. Symptoms vary between patients and include abdominal pain, diarrhoea and weight loss. The inflammation is controlled by drugs (steroids, immune modulators) and surgery. The cause of CD is unknown, but is likely to be an inappropriate or exaggerated mucosal immune response to constituents of gut microflora. The prevalence of CD varies across geographic region and time, with incidence and prevalence rates increasing recently in the developed world. Prevalence rates are between 54 and 214 per 100 000 population,1 and the onset of CD generally occurs between 15 and 30 years of age.

The aetiology of CD is complex. Monozygotic twins show approximately 50–60% disease concordance, with much lower rates in dizygotic twins (∼10%), highlighting the role of both environmental and genetic components in the development of CD.2 3 The strongest and best replicated environmental risk factor is smoking, which increases both the risk of developing CD and the severity of CD after diagnosis. Other environmental factors that may play a role in CD are appendectomy, oral contraception, diet and domestic hygiene, but the evidence for each of these is much weaker than for smoking, and many contradictory studies exist (reviewed by Loftus1).

Familial studies confirm that first degree relatives have a substantially increased risk of CD compared with the general population.48 The lack of a consistent pattern of inheritance in families and the failure of linkage studies to identify strong linkage to any single region indicate that genetic susceptibility to CD is multifactorial, with many genes contributing a small effect towards underlying disease risk. Major progress on the identification of genes contributing to CD has been made recently. The first susceptibility gene to be identified was CARD15 (NOD2);911 three mutations increase risk of CD in a gene dosage model, with CD risk increasing with the number of mutations (0, 1, 2) carried.12 13 However, these mutations are also present in 12% of the general Western population, and therefore do not form a very specific test for CD risk. Two further genes have been identified through genome-wide association studies: IL23R14 and ATG16L1,15 and both findings have been independently replicated.1618

In this paper, we present a model for estimating an individual’s risk of developing CD, incorporating genetic and environmental risk factors. The model combines CARD15 genotypes with smoking and family history to calculate CD risk. Although no single factor forms a test of high predictive value, risk factors in combination with each other may raise risks of CD substantially. This provides the potential to identify population subgroups that are at high risk of developing CD, with a view to intervening to prevent disease.19


Risk components

The probability of developing a multifactorial disease depends on the risk conferred by each genetic and environmental risk factor modelled, and the assumed interaction between these factors. We modelled risk of a disease, D, on three factors: an environmental variable, a genotype at a susceptibility gene and family history.

  1. For an environmental risk factor, E, assume a proportion s of the population are exposed to E, increasing their risk of disease by odds ratio r compared with the non-exposed member of the population (E’). So,

P(D|E) = r P(D|E′)


P(E) = s.

  1. 2. Define a gene, G, with three genotypes (AA, AB, BB), where A is the wildtype allele having frequency p, and B is a mutation increasing risk of disease and having frequency q ( = 1−p). Assume an arbitrary model for disease risk, with the genotype relative risks for disease risk by genotype being g0 ( = 1), g1 and g2, where

P(D|AB) = g1 P(D|AA),


P(D|BB) = g2 P(D|AA).

Let the disease penetrances be f0, f1 and f2, where f0 = P(D|AA), and similarly for f1and f2, so that g2 = f2/f0 and g1 = f1/f0. Penetrances can be calculated directly from genotype counts in a case control study using Bayes rule,

P(D|AA) = P(AA|D) P(D)/P(AA),

where P(D) is disease prevalence, and P(AA) is the population frequency of genotype AA. P(AA) is estimated from control genotype frequency if controls are a random population sample of people, or if the disease is rare. If controls are selected to be unaffected and the disease is common, then P(AA) is determined from the weighted case and control frequencies.

Alternatively, the summary measures of risk presented in meta-analysis of genetic association studies may be used to estimate penetrances, enabling risk calculations to be based on more robust estimates than a single case–control study. Risks are usually presented as genotype relative risks (g1, g2) for genotypes AB and BB, relative to the wildtype AA genotype. Expressing the disease prevalence as K = f0 p2+2pq f1+q2 f2and solving for f0, gives penetrances in terms of the commonly available measures of allele frequency and genotype relative risk.

  1. 3. The final component of the model is the residual familial risk, after accounting for the contribution of gene G to the familial aggregation of disease. Familial risk can be assessed through the sibling relative risk (λS), which measures the increase in disease risk to the sibling of a case, compared with a random member of the population.20 For a gene, G, which has a multiplicative relationship with the remaining familial risks, λS can be partitioned into the relative risk due to this gene (λS,G), and to the residual risk, after accounting for the gene (λS,G’), by

λS = λS,G λS,G’

λS,G can be calculated from the disease risk due to G above,20 21 as

λS,G  = 1+ (½ VA + ¼ VD)/K2


K is disease prevalence,

VA = 2pq (q (f2 − f1) + p (f1 – f0))2

is the additive genetic variance at G and

VD = p2 q2 (f2 –2 f1 + f0)2

is the dominant genetic variance at G.

Similar expressions using genotype relative risks (1, g1, g2) in place of penetrances (f0, f1, f2) can be derived, and then no assumption of disease prevalence is needed in calculating λS,G. The sibling relative risk, λS, can also be extended to multiple genes, provided no epistasis exists between the genes modelled and between the genes and the residual family history. For estimating disease risk to parents or offspring, λP and λO are used, with

λP,G = λO,G  = 1+½ VA/K2.

Disease risk calculation

To model the disease risk of an individual based on their genotype at G, exposure to the enrivonmental factor E, and their family history, we assume independence between the environmental factor, genotype at gene G and residual family history (after accounting for G). For the environmental factor E,

P(D|E) = r P(D|E’)


P(D) = P(D|E) P(E)+P(D|E’) (1−P(E)),


P(D|E) = rK/(rs+(1-s)),


P(D|E’) = K/(rs+(1-s)).

P(D|G = gi) are given above for gi = (AA, AB, BB), using genotype relative risks and disease prevalence.

For the sibling of a case, the increase in risk unaccounted for by gene G is λS,G’. Piecing together both genetic and environmental contributions to risk, assuming they act independently, gives the probability of disease in the sibling of a case as:

P(D|G, E, sib affected) = P(D|G)P(D|E) λS,G’/P(D) = 


where GRR(G) and OR(E) are the genotype relative risk and odds ratios of disease conferred by exposure or non-exposure to the environmental risk factor.

Application to CD


Smoking is a well-established risk factor for CD, which increases the risk and severity of CD. A series of epidemiological papers estimated the CD risk associated with smoking, matching patients with CD to controls by relevant factors such as age, sex and birth year, then calculating the odds ratio of CD in smokers compared with never-smokers.2224 Many of these papers considered small series of cases and controls, and some used imprecise temporal data of current smoking status (especially in controls), rather than smoking at age of diagnosis of CD in the matched case. A meta-analysis of seven studies estimated a pooled odds ratio (OR) of 2.0 (95% CI 1.65 to 2.47) for CD in smokers compared with never-smokers.25 No consistent effect on CD risks was seen when data were analysed by sex, by smoking duration, or numbers of cigarettes smoked per day;25 however, cigarette smoking may specifically increase risk in late-onset CD patients (age of diagnosis >40 years).26

Several studies also calculated CD risks in former smokers. The meta-analysis showed an increased risk of CD compared with lifetime non-smokers, with an odds ratio of 1.80 (95% CI 1.33 to 2.51), only slightly lower than the risk for current smokers. Detailed assessment of the time since smoking cessation in the ex-smokers suggests that CD risks remain raised immediately after smoking cessation, and reduce after 2–4 years.22 24 A more recent case control study of CD and both genetic and epidemiological risk factors estimated an OR for CD of 3.3 in smokers, and 1.7 in ex-smokers (95% CI 1.1 to 2.7).27 However, no study has systematically examined the risks of CD in long-term ex-smokers, and the potential role of smoking cessation in reducing risks of CD remains unclear. Smoking is currently the only potential modifiable risk factor for CD and such information merits being communicated to smokers at risk of CD. Whether such communication is effective in achieving smoking cessation in this group remains to be seen (

Family history of CD

The familial risks of CD are estimated through ascertaining probands diagnosed with CD, and requesting information on which relatives are affected and unaffected with CD. Four Northern European studies were identified as having sufficient information from a large cohort of CD probands to estimate familial relative risks48 (table 1). CD twin studies were omitted because of the low numbers of CD probands.2 3 Other studies could not be used because CD relatives were classified only as first or second degree, without information on the precise relationship to the proband.28

Table 1 Summary of family history studies

Kuster et al4 performed a segregation analysis of 265 probands with CD from Dusseldorf, Germany, finding 13 siblings with CD (from a total of 453 siblings), and 4 parents with CD (from 530 parents). Satsangi et al8 studied 433 adult CD patients in Oxford, UK, showing that 33 siblings, 18 parents and 4 offspring were affected with inflammatory bowel disease. Of the 33 siblings, 20 were affected with CD, and assuming that the ratio of relatives affected with CD to those affected with ulcerative colitis remains constant across all first degree relatives, we estimate that 11.5 parents and 2.4 offspring were affected with CD. Peeters et al7 ascertained 640 probands with CD in Belgium, showing that 13.6% had a first degree relative with CD, and classifying the affected relatives into siblings (n = 57/1728), parents (n = 21/1280) or offspring (n = 17/835). Probert et al6 collected family history information on 424 patients with CD from Leicester, UK, finding 19 of 984 siblings affected with CD, 8 of 825 parents and 8 of 493 offspring.

Calculating the sibling relative risk requires a value for CD prevalence. This has been estimated in several UK studies, with rates per 100 000 population of 76 in Leicestershire6 and 147 in NE Scotland.29 The incidence of CD is increasing in all Western populations, so more recent estimates are higher, but these studies are from the same time period as the familial studies in table 1. In this study, we assumed 100 patients with CD per 100 000 of population, giving a disease prevalence of K = 0.001.

The sibling relative risk (λS) is defined as the ratio of the observed risk to siblings of patients with CD to the population prevalence (so λS  = 1 provides a baseline of no increased familial risk). Study-specific sibling relative risks ranged from 196 to 33,7 and weighting by the number of probands per study gives a mean sibling relative risk of 27.2 (table 1). Similar calculations from these papers show the estimated relative risk for parents to be λP = 12.8 (from four studies), and relative risk for offspring of probands to be λO = 16.6 (excluding the German study, which did not study the offspring of probands). The lower relative risks of parents and offspring may be due to the greater shared environment of siblings compared with offspring and parents. Alternatively, risks may reflect the lower prevalence of CD relevant for the parental cohort, and the young age of the offspring, who may not have lived through the full risk period for CD.

CARD15 risks

CARD15 (NOD2) was localised in 2001, with the identification of three different mutations (R702W, G908R, L1007fs) conferring increased risk of CD.911 Other, much rarer mutations in CARD15 exist, with frequency <0.1%,30 31 but are found in only a small proportion of patients with CD. In common with other studies, these variants are not considered here.

Many research groups have published estimates of the population frequency and genotype relative risks of these CARD15 mutations, and a meta-analysis of 42 studies showed that people carrying one mutation had 2.3-fold increased odds of CD (95% CI 2.00 to 2.86), and people carrying two mutations had a 17.1-fold increased risk of CD (95% CI 10.7 to 27.2).13 Differences across populations exist, with no mutations present in Asian populations,32 and a trend across Europe, with higher frequencies of the 1007fs frameshift in central and southern Europe than in northern Europe.33 Differences in risk conferred by each mutation were detected in the meta-analysis and a large French family study,13 34 but few individual studies have sufficient power to detect this effect.

In this study, we used data from our large UK study of 1639 patients with CD and 1808 controls, genotyped for the 3 CARD15 mutations as described by Prescott et al.16 Individuals were categorised by the number of CD mutations carried (0, 1, 2), as mutations occur only rarely on the same haplotype, and without discriminating between the different mutations (table 2). Genotype relative risks are 2.25 (95% CI 1.88 to 2.68) for people carrying one mutation and 9.25 (95% CI 5.44 to 15.7) for people carrying two mutations, compared with those carrying no CARD15 mutations. Using these figures in the equations for the sibling relative risk due to CARD15 above gives λS,CARD15 = 1.16.

Table 2 CARD15 genotype counts for UK patients with CD and controls


The estimated risks of developing CD, based on the model developed above, are shown in tables 3 and 4. The parameter estimates for the risk calculation are summarised in table 3; in addition to the values derived and discussed above, we assumed that the smoking prevalence in the UK is 24%.35 The CD risks for the sibling, parent or offspring of a patient with CD, based on CARD15 mutation status (0, or 2 mutations or ungenotyped) and smoking status (smoker, non-smoker or unknown), are given in table 4. The baseline relative risk for the sibling of a patient with CD is a sibling relative risk of λS = 27.2, which gives a 2.7% risk of developing CD (assuming a disease prevalence of 0.001). A person who carries no mutations in CARD15 has the risk of CD reduced slightly to 2.3%, person with a single CARD15 mutation has a slightly increased risk of 5.3%, and people with two mutations have a greatly increased risk of 21.7%. Siblings who also smoke increase their risks still further to 3.8% (0 mutation), 8.5% (1 mutation), or 35.0% (2 mutations). Non-smokers have risks equal to approximately half the risks for siblings who smoke. Without information on CARD15, the risks for non-smoking and smoking siblings of patients with CD are 2.2% and 4.4%, respectively. Adding CARD15 mutation status therefore provides considerable resolution of these risks, and permits the identification of a cohort of people (smokers who carry two CARD15 mutations and have a sibling with CD) who have a high probability of developing CD.

Table 3 Parameter estimates used in CD risk calculation model
Table 4 Absolute risks of developing CD

For siblings of a patient with CD, a non-smoker with no CARD15 mutations decreases their risk by approximately one-third, whereas a smoker with two CARD15 mutations increases their risk 13-fold, compared with risks of 2.7% based solely on family history information. A non-smoker with a single CARD15 mutation is at approximately the same risk of developing CD as a smoker who carries no CARD15 mutations (as the OR for smoking and for one CARD15 mutations are 2 and 2.24 respectively). The major contribution to disease risk is carrying two CARD15 mutations (OR = 9.24), and the residual family history risk, after accounting for CARD15S,CARD15 = 23). Similar patterns of risk are seen for the parents and offspring of patients with CD. A program for risk calculation is available on request.


We have developed a model to calculate individual-specific risks for complex genetic diseases, based on our current knowledge of environmental and genetic risk factors. This model shows how inclusion of genotype information from only a single gene and environmental risk factor can substantially change disease risks. For an individual with all risk factors present, a moderately high disease risk was obtained (34% for smoking siblings of a patient with CD, with two CARD15 mutations).

Few complex genetic diseases have sufficient information on both genetic and environmental factors to estimate disease risks. In age-related macular degeneration, Maller et al36 estimated risks of up to 50% from single-nucleotide polymorphism genotypes at three genes. Smoking is also a major risk factor for this condition and, in contrast to CD, an interaction with the LOC387715 locus exists, with risks for smokers who also carry the high-risk genotypes being much higher than predicted by the marginal risks conferred by smoking and LOC387715.37 A population study of CD in Manitoba27 showed that people with a family history of CD who also smoked and carried two CARD15 mutations had a much increased risk of CD, with an OR of 257 (95% CI 63 to 1054). The methodology of the Manitoba study was different from the current study, with risk information assessed from a series of healthy controls (n = 336) and patients with CD (n = 232), giving an OR of 13.9 for people carrying two CARD15 mutations, 3.0 for smokers, and 6.2 for the presence of CD in a first degree relative. Despite the disparity of the designs and the specific risk estimates calculated, both this study and the Manitoba study illustrate how combining risks across several factors of modest effect can produce high predicted disease risks for specific subgroups of the population.

One limitation of our model is that it uses no information on the age of the affected and unaffected siblings, or the time since diagnosis of the patient with CD. Siblings are correlated in age at diagnosis, with most affected siblings diagnosed within 10 years of each other.7 38 Therefore a sibling who remains unaffected with CD more than a decade after the diagnosis of their sibling has lived through the majority of their at-risk period, and must be considered at lower risk. Joint distributions of age at diagnosis of CD in family members could be used to develop further methodology. For example, age-dependent methods for risk estimation were developed in the REVEAL study of Alzheimer’s disease, incorporating APOE genotypes and family history and age, using Kaplan–Meier survival methods.39 40 Our risk model assumes that risk factors are independent, with multiplicative relationship across risks conferred. In CD, a joint study on CARD15 and variants in IL23R and on chromosome 5q31 showed that risks across genes were multiplicative,16 and smoking prevalence is similar in CD patients with and without CARD15 mutations, implying the lack of any interaction between these risk factors.27 30 Current studies therefore support the assumption of independent contribution to risk from each genetic and environmental source. The model can be expanded to multiple genes (eg IL23R and ATG16L1) and additional environmental risk factors, provided the assumption of multiplicative risks across factors holds.

The method for risk estimation developed here is flexible in allowing risk factors estimated from independent studies to be combined. This is a valuable property for CD, as few studies with good information on family history and environmental and genetic risk factors exist. Other methods for risk estimation require more extensive data. For example, the BOUDICEA model was developed for familial breast cancer using segregation analysis on extended families, and incorporating information from a population series of breast cancer cases.41 Extended pedigrees for CD are rarely available, and genetic studies have focused on collections of affected people (affected sibling pairs, case series), so this tool could not be used here. For studies in which genetic and environmental risk factor information is available for all participants, regression models may be used to estimate risks, and then these risks validated in an independent dataset. Again, suitable datasets for CD are not currently available, and no validation for the current model has yet been performed.

The utility of screening unaffected relatives for mutations in CARD15 to assess their risk of developing CD has been much debated.4244 Such screening is considered to be of limited value because of the low predictive value of the test (14% of the UK population carries at least one mutation in CARD15). However, the utility in screening smokers at risk who have the potential to reduce their risk is unknown. The risk estimates published here are being used to provide smokers who are first degree relatives of patients with CD with estimates of the likelihood that they will develop CD, together with information about the potentially risk-reducing effects of smoking cessation. Of interest is whether providing precise disease risk estimates based on DNA analysis together with information about reducing these risks motivates smoking cessation, a question we are investigating in a randomised controlled trial.


We thank Natalie Prescott and Sheila Fisher for data on CARD15 genotypes. We acknowledge use of DNA from the British 1958 Birth Cohort collection, funded by the Medical Research Council (grant G0000934) and The Wellcome Trust (grant 068545/Z/02).


View Abstract


  • Funding: This research was supported by the Wellcome Trust (076024 to CML; 072029 to CGM) and the Medical Research Council (G0500274 to TMM for Risk communication in preventative medicine: optimising the impact of DNA risk information). Guy’s and St Thomas’ NHS Trust in conjunction with KCL, and University College Hospital in conjunction with UCL, receive funding support from the UK National Institute for Health Research.

  • Competing interests: None declared.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.