Enhancing the BOADICEA cancer risk prediction model to incorporate new data on RAD51C, RAD51D, BARD1 updates to tumour pathology and cancer incidence

Background BOADICEA (Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm) for breast cancer and the epithelial tubo-ovarian cancer (EOC) models included in the CanRisk tool (www.canrisk.org) provide future cancer risks based on pathogenic variants in cancer-susceptibility genes, polygenic risk scores, breast density, questionnaire-based risk factors and family history. Here, we extend the models to include the effects of pathogenic variants in recently established breast cancer and EOC susceptibility genes, up-to-date age-specific pathology distributions and continuous risk factors. Methods BOADICEA was extended to further incorporate the associations of pathogenic variants in BARD1, RAD51C and RAD51D with breast cancer risk. The EOC model was extended to include the association of PALB2 pathogenic variants with EOC risk. Age-specific distributions of oestrogen-receptor-negative and triple-negative breast cancer status for pathogenic variant carriers in these genes and CHEK2 and ATM were also incorporated. A novel method to include continuous risk factors was developed, exemplified by including adult height as continuous. Results BARD1, RAD51C and RAD51D explain 0.31% of the breast cancer polygenic variance. When incorporated into the multifactorial model, 34%–44% of these carriers would be reclassified to the near-population and 15%–22% to the high-risk categories based on the UK National Institute for Health and Care Excellence guidelines. Under the EOC multifactorial model, 62%, 35% and 3% of PALB2 carriers have lifetime EOC risks of <5%, 5%–10% and >10%, respectively. Including height as continuous, increased the breast cancer relative risk variance from 0.002 to 0.010. Conclusions These extensions will allow for better personalised risks for BARD1, RAD51C, RAD51D and PALB2 pathogenic variant carriers and more informed choices on screening, prevention, risk factor modification or other risk-reducing options.

= 1 for noncarriers of any PV and 0 otherwise. The cancer incidences associated with homozygous and heterozygous carriers of PVs in each gene are assumed to be the same, and the risk to carriers of PVs in more than one gene is assumed to be that of the higher-ranked PV in the dominance order. Because PVs are rare, this model can be well approximated by assuming a single locus with A %& + 1 alleles, one representing the presence of a PV in each of the A %& genes and an additional wild-type allele representing absence of PVs in all genes 1 . . %&' (") represent the age-specific log-relative risks (log-RRs) associated with the major genes relative to the baseline incidence. The relative risks (RR) assumed for the major genes are summarised in Table 1. = 1 (") is the polygenotype for individual !, assumed normally distributed in the general population with mean 0 and standard deviation 1, and . 1& (") is the age-specific log-RR per standard deviation associated with the polygene, relative to the baseline incidence 2 3 . When a PRS is known, the polygenotype is decomposed into an observed and residual component where the observed component is given by the PRS 4 . E indexes the RFs that are present in the model, which are modelled as categorical factors. 0 ()*' (") is the vector (length F * − 1 were F * is the number of categories for RF E, with one category being the baseline) of age-specific log-RRs associated with RF E, which may depend on the major genotype C, and 2 ()* (") is the corresponding vector of indicator variables (0 or 1) that indicate the category of RF E for individual ! (1 for the observed category, 0 otherwise, with all elements 0 for the baseline). The baseline incidences # $ (") are determined so that the total age-specific incidences, summed over the RFs and genotypes, agree with the population incidence (given the assumed population distributions and RRs) 2 5 . The population incidences are birth-cohort and countryspecific, but this dependence is omitted from equation (s.1) for clarity of notation. The RRs and distributions of the RF have been described elsewhere 4 6 . To allow appropriately for missing RF information, only those RFs measured on a given individual are considered (thus, the baseline incidence, # $ (") are determined for each individual dependent on their measured RFs).
The models assume that RRs associated with PVs in the major genes are log-additive (multiplicative) with the RFs and the polygenic component. The model also assumes that the PVs and the PRS combine multiplicatively (conditional on other factors).
The models evaluate pedigree likelihoods using the MENDEL software 7 . As MENDEL considers only finite discrete genotypes, the polygenotype is approximated by the hypergeometric polygenic model 1 5 8 .
Both models consider family history of breast cancer (BC), EOC, pancreatic cancer (PaC) and prostate cancer (PrC). The incidences of each cancer are assumed independent, conditional on the genotypes and RFs in the model. In BOADICEA, EOC, PaC and PrC are assumed to depend only on the major genotype. Correspondingly, in the EOC model, BC, PaC and PrC are assumed to depend only on the major genotype.

Adjusting the residual polygenic component after the inclusion of new major genes
The variance due to PVs in each gene at age " is given by: where N ' is the population allele frequency of gene C; the variance components are assumed to be additive. This process also considered the updated RR and PV frequencies for the previously included genes. For BOADICEA, the overall BC polygenic variance was 4.83 − 0.5961 × " for females and 1.4 for males, while for the EOC model, the overall EOC polygenic variance was 1.434 2 3 .

Allele Frequencies
Allele frequencies for all genes, except BRCA1 and BRCA2, were taken from the BRIDGES study 9 . The frequencies were based on the frequency of protein-truncating variants in European ancestry controls. To account for the incomplete sensitivity of the sequencing as performed in BRIDGES, the frequencies were adjusted by dividing by X?(1 − G), where X is the proportion of the coding sequence of each gene determined to be callable, ? is the proportion of variants in the called sequence across all genes that were detected (estimated to be 0.957), and G is the proportion of the pathogenic variants expected to be copy variants. For CHEK2, the adjustment was applied to variants excluding c.1100delC. Details are given in the Supplementary Material of Dorling et al. 9 . For BRIP1, G was assumed to be 0.05. The BRCA1 and BRCA2 frequencies from the previous versions of BOADICEA and the EOC model were used for consistency.

Sensitivities
The default sensitivities are based on the assumption that protein truncating variants and known pathogenic missense variants are detected with close to 100% sensitivity in clinical tests but that, except for BRCA1 and BRCA2, large rearrangements are not detected. The sensitivities are therefore given by 1 − G, as above. For BRCA1 and BRCA2, sensitivities were defined by assuming that the main source of insensitivity was missense variants not classified as pathogenic -the frequencies of these variants have been estimated by Dorling et al 10 .
Were large re-arrangements not tested for, the corresponding sensitivities for BRCA1 and BRCA2 would be reduced to ~76% and 95% respectively (owing to the much higher frequency of large re-arrangements in BRCA1).

BRCA2: ovarian cancer relative risks updates
Previous estimates of the EOC relative risks for BRCA2 PV carriers were obtained during the BOADICEA model fitting process, using complex segregation analysis in families with BRCA2 PVs 2 . This involved fitting models in which the log-relative risks were piecewise linear functions of age. Due to the very small number of EOCs diagnosed in ages 65 years and over in the original dataset, the RR was estimated to decrease rapidly from 23.7 at age 58, to 1.59 at age 69 and remain constant at that level thereafter. However, more recent data suggest that the EOC RRs for ages 70 and over are higher 11 . The original RR estimate of 1.59 may result in an underestimation of risks for older BRCA2 carriers. We therefore updated the log-RR function included in the model by re-deriving the piecewise log-RR linear function such that the EOC RR decreases less rapidly from 23.7 at age 58 to 4.4 for ages 70 and over. The RR=4.4 estimate used for ages 70 and over was obtained from a prospective cohort analysis of BRCA2 PV carriers 11 .
The updated log-RR EOC parameters for ages 58 and over for BRCA2 carriers are shown in Table 1 and the resulting age-specific EOC cumulative risks are shown in Figure s3.

Population Incidences
The BOADICEA and EOC models both allow population customisation via population-specific incidences 4 Figure s1.
Incidences for some of the existing regions were updated using data from more recent calendar years.  The models use calendar-specific population incidences to calculate cohort-specific incidences 2 , where the cohorts are defined by decadal birth year ranges (1910-1919, 1920-1929, 1930-1939, 1940-1949, 1950-1959, 1960-1969, 1970-1979 and 1980-1989 with individuals born before/after the first/last cohort, assumed to have the same incidences as the first/last cohort). The original model used UK incidences from CI5, which reported calendar incidences averaged in 5-year calendar-period bins 2 15 . Cohort incidences were then taken as those for someone born in the middle year of each range to represent that cohort (1915 for 1910-1919 etc.). However, some of the other regions have smaller populations and report annual-calendar-period specific incidences. For these populations, especially for cancers with low incidences (e.g., EOC and male BC), using a single year to represent the cohort can lead to cohort incidences dominated by year-on-year calendar fluctuations. The methodology was refined by deriving new sets of cohort incidences. In these, the age-specific incidences for an individual in the cohort were taken as the average of the age-specific incidences applicable to those born in each year of the birth-cohort range. The average ageand cohort-specific incidences were then smoothed using LOWESS with linear regression and a bandwidth of 0.2. Figure s2 (b) shows the effects of the new averaging method on cohort incidences for Estonian male breast cancer incidences for those born in the 1920s.
Further, previously, incidences for years before/after the earliest/latest calendar year were taken to be the same as those in the earliest/latest calendar year available. Again, for regions with small populations presenting annual calendar-period incidences and cancers with low incidences, the cohort incidences can be adversely affected by statistical anomalies present in incidences of the earliest/latest calendar year. The methodology was refined with incidence for years before/after the earliest/latest calendar year taken as the average of the first/last five years of the available annual calendar-period incidences.

Algorithm optimisation
The BOADICEA future risk calculations rely on calculating pedigree likelihoods under the assumed genetic models of inheritance 2 . The inclusion of additional genes (RAD51C, RAD51D, BARD1) in the model resulted in a substantial increase in runtime. This is further compounded by the fact that separate pedigree likelihood calculations are required for risk predictions at multiple future time-points when using the CanRisk tool (e.g. in annual, or 5-year intervals). To reduce the programme runtime when using CanRisk we re-formulated the underlying algorithm to calculate the future risks as follows.
BOADICEA calculates the probability that an individual develops breast (or ovarian) cancer over a given time period, given the age of the proband, the genotypes, other risk factors, and family history: Where Z(") is the phenotype of the proband at time t, Z ( represents the phenotypes of the all the relatives, \ are the risk factors measured on the proband and ] are the genetic model parameters (allele frequencies, relative risks etc). " $ is the current age of the proband and " -BMJ Publishing Group Limited (BMJ) disclaims all liability and responsibility arising from any reliance Supplemental material placed on this supplemental material which has been supplied by the author(s) the future age at which the predictions are being made. In practice, these are calculated as the ratio of two pedigree likelihoods 2 : The numerator and denominator probabilities, are the probabilities of the full set of phenotypes in the pedigree at times " -and " $ and are calculated in MENDEL, using a pedigree peeling algorithm 7 . When predicting future risks, this involves performing this calculation repeatedly at several time-points. However, under the standard assumption in pedigree likelihood calculations, the phenotype of the proband is conditionally independent of those of relatives given the genotypes of the relatives 28 . Thus: which can be re-written as: where G is the full set of genotypes (including the full measured and unmeasured polygenic or major gene components) and Therefore, the risk prediction (expression (s.2)) can be performed by first calculating the genotype probabilities for the proband given the phenotypes at time " $ (i.e. a single, timeconsuming pedigree likelihood calculation) and then calculating the penetrance function for the proband at multiple time-points Y(Z(" -)|9, \, ]), which does not involve any pedigree likelihood calculations.
The risk calculations under the revised and original formulations are identical, but when calculating the remaining lifetime cancer risks used in the CanRisk tool (www.canrisk.org), there is a 50-90% reduction in computation time under this revised formulation, depending on the proband's age (Figure s4).   Table s2. Predicted ovarian cancer risk to age 50 (age 20 to 50 years) and lifetime risk (age 20 to 80 years) for a female born in 1985 with unknown family history and for a female with a mother affected at age 50. The columns labelled "Risk" contain risks in the absence of information about risk factors (RF) or a polygenic risk score (PRS). The other columns show the distribution of females based on these risk factors falling into risk categories defined as: 1) near-population risk, shaded pink (< )% lifetime risk; < %% risk to age 50), 2) moderate risk, shaded yellow (≥ )% and < "'% lifetime risk; ≥ %% and < )% risk to age 50) and 3) high risk, shaded blue (≥ "'% lifetime risk; ≥ )% risk to age 50). Column headings are shaded the same colours as the corresponding lines in Figure 1