Introduction

In women, breast cancer is the most commonly diagnosed cancer and the leading cause of cancer-related deaths1. Approximately 5–10% of all breast cancers are estimated to develop due to high-impact germline mutations in breast cancer susceptibility genes, with up to 30% due to pathogenic mutations in BRCA1 and BRCA2 and with a smaller proportion carrying mutations in other susceptibility genes, such as PTEN, TP53, CHEK2, PALB2 and STK112. While pathogenic mutations in BRCA1 and BRCA2 are less common in Finns3, two frameshift mutations, c.1592delT (rs180177102) in PALB2 and c.1100delC (rs555607708) in CHEK2 have an unusually high allele frequency in Finland, which provides a unique opportunity to explore the impact of these mutations in the population. PALB2 (Partner and Localizer of BRCA2) encodes a key tumour suppressor protein that functions through affecting BRCA2 nuclear localisation and DNA damage response functions, and through interacting with BRCA14. The second gene, CHEK2 (Checkpoint kinase 2), is a tumour suppressor gene encoding a serine/threonine-protein kinase involved in DNA repair, cell cycle arrest and apoptosis5.

Beyond genetic predisposition caused by high-risk mutations in breast cancer susceptibility genes, breast cancer has a highly polygenic mode of inheritance. Large-scale genetic screens have to date identified over a hundred loci associated with risk of breast cancer6. These variants, and many more yet to be discovered, represent common genetic variation acting through a wide range of molecular pathways, in contrast to the rare, high-risk pathogenic variants in high-risk breast cancer susceptibility genes that often disrupt a specific pathway involved in maintaining integrity of DNA repair processes. Individually, the common variants have very small effect sizes with odds ratios usually ranging from 0.85 to 1.20, but their cumulative impact in breast cancer risk has been shown to be considerably larger7. This cumulative effect can be captured in a single measure by a polygenic risk score (PRS), the summed contribution of many common risk variants, which is able to identify women at over 3-fold risk of breast cancer, compared to women with an average risk7,8. By improving identification of these women at high risk of breast cancer, it could serve as a new tool for personalised, risk-based breast cancer screening7,9,10.

Here, we comprehensively assess the impact of germline genetic variation on risk of breast cancer and show (1) how a high breast cancer PRS compares to high-risk mutations in breast cancer susceptibility genes, (2) how the PRS modifies the risk of breast cancer in women carrying pathogenic mutations in the PALB2 and CHEK2 genes and (3) that the PRS has utility for informing about risk of contralateral breast cancer, and about the risk in first-degree relatives. We use data from the FinnGen study, which combines nationwide health registries with genomic information for 122,978 women from across the country, representing 5% of the Finnish adult female population.

Results

We studied 122,978 women in FinnGen, with the mean age at the end of follow-up 58.5 (inter-quartile range, IQR 45.1–72.2, range 16.0–106.0). In FinnGen, 8401 (6.8%) women have been diagnosed with breast cancer, with mean age at disease onset of 58.6 (IQR 50.4–66.3, range 21.3–98.3 years). We first tested the association of three polygenic risk scores on breast cancer risk: a 313 SNP score7, a genome-wide score by Mars et al.10 derived using LDpred software and a new genome-wide score derived using PRS-CS11. In our data, the genome-wide scores outperformed the 313 SNP score with hazard ratio (HR) estimates per standard deviation at 1.55 (95% confidence interval, CI 1.52–1.58), 1.63 (CI 1.60–1.67) and 1.71 (CI 1.68–1.75) of the PRS for 313 SNP score, LDpred score and PRS-CS score, each scaled separately to mean zero and unit variance. We therefore chose the PRS built with PRS-CS for subsequent analyses (Table 1 and Supplementary Table 1).

Table 1 Comparison of effect sizes for three polygenic risk scores (PRS) with and without excluding the PALB2 and CHEK2 loci and regions around them.

We then studied the allele frequencies, geographic variation and risk estimates for the two Finnish-enriched, high-impact breast cancer mutations. The allele frequency for rs180177102 (PALB2) was 0.0014 (242-fold enrichment compared to non-Finnish non-Estonian Europeans, NFEE12), with 336 heterozygote mutation carriers included in the analyses. The allele frequency for rs555607708 (CHEK2) was 0.0064 (3.7 times enriched in Finns compared to NFEE), with 1641 heterozygotes and 7 homozygote individuals.

Geographic variation of genetic risk

Considering Finns have passed internal genetic bottlenecks, we first aimed to characterise any geographic distribution for both the PALB2 and CHEK2 mutations, and for the PRS. Both the PALB2 and CHEK2 mutations had more carriers in Eastern Finland, with the proportion of carriers ranging from close to 0 in Western Finland, to 2.8% for CHEK2 in South Karelia and to 0.8% for PALB2 in North Karelia (Fig. 1). In contrast, we observed slightly higher proportions of individuals with high PRS in Western and Southern Finland, in line with breast cancer incidence.

Fig. 1: Geographic variation in genetic risk.
figure 1

The risk is compared to age-standardised breast cancer incidence. The proportion of women with the breast cancer polygenic risk score (PRS) above the 90th percentile in each region is estimated with respect to the PRS distribution of the whole country. The PALB2 and CHEK2 maps show across different regions the proportion of women carrying at least one risk allele for the variants. The areas represent region of birth obtained from Statistics Finland. The national breast cancer incidence in women was obtained from the Finnish Cancer Registry (publicly available at https://cancerregistry.fi/statistics/) with diagnosis C50 (International Classification of Diseases for Oncology, 3rd edn, ICD-O-3). The incidence represents the mean of 5-year age-standardised incidences (based on the 2014 Finnish population, calculated for each hospital district over 1998–2007). The mean and standard deviation were calculated over the different regions. Variants: rs180177102 (c.1592delT) for PALB2 and rs555607708 (c.1100delC) for CHEK2. CHEK2 and polygenic risk score plots are based on 122,978 women, and PALB2 on 109,371 women. Colour contrasts were chosen approximately based on the standard deviation for each map.

The effect of frameshift mutations in PALB2 and CHEK2

Both PALB2 and CHEK2 conferred considerably elevated risk for breast cancer (Table 2). The PALB2 variant conferred a risk increase for breast cancer with a HR of 4.99 (95% CI 4.02–6.20, p = 6.76 × 10−48), corresponding to a lifetime risk by age 80 of 56.1% (95% CI 50.8–61.4%). The CHEK2 variant conferred a risk increase for breast cancer with HR 2.19 (95% CI 1.91–2.51), p = 3.90 × 10−29), corresponding to a lifetime risk of 31.7% (95% CI 29.5–33.9%). Comparing to women with a PRS between the 10th and 90th percentiles (lifetime risk 15.5%, 95% CI 15.3–15.7%), women with PRS above the 90th percentile had a similar effect size as CHEK2 mutation carriers (HR 2.38, 95% CI 2.26–2.50, p = 1.98 × 10−230) and their similar lifetime risk was similar (32.5%, 95% CI 31.6–33.4%). However, a high PRS affected a nearly 7-fold larger group of women (Table 1; results excluding first-degree relatives in Supplementary Table 2). Estimating these while accounting for competing risks (non-breast cancer related death) yielded 4.6%, 4.9% and 3.2% lower estimates for lifetime risks in carriers of the PALB2 and CHEK2 mutations, and women with high PRS, respectively (Supplementary Figs. 1–3 and Supplementary Table 3).

Table 2 Risk for breast cancer events in the population in carriers of the PALB2 and CHEK2 frameshift mutations, and in the top decile of the polygenic risk score (PRS).

PRS modifies the risk in PALB2 and CHEK2 mutation carriers

Next, we estimated how the PRS modifies breast cancer risk in the mutation carriers. For both PALB2 and CHEK2, a high PRS further increased the breast cancer risk. In terms of lifetime risk for breast cancer by age 80, women with the PALB2 mutation and average PRS (10–90th percentile) had a lifetime risk of 55.3% (95% CI 49.4–61.2%), which increased to 83.9% (71.2–96.6%) among women with a high PRS (>90th percentile), and decreased to 49.1% (30.6–67.6%) in women with a low PRS (<10th percentile; Fig. 2 and Tables 3 and 4). Women with CHEK2 and an average PRS had a lifetime risk of 29.3% (95% CI 26.8–31.8%) which doubled to 59.2% (52.1–66.3%) in women with a high PRS and decreased to 9.3% (4.5–14.1%) in women with low PRS.

Fig. 2: The impact of polygenic risk in PALB2 and CHEK2 mutation carriers.
figure 2

Adjusted survival curves showing how the polygenic risk score (PRS) affects the breast cancer risk conferred by the PALB2 (panel A) and CHEK2 (panel B) frameshift mutations. Population level was defined as women with PRS between the 10th and 90th percentiles. The PALB2 analysis was done in 109,371 women and CHEK2 analysis in 122,978 women. Adjusted survival curves Cox proportional hazards model.

Table 3 Impact of polygenic risk score (PRS) on the breast cancer risk conferred by the PALB2 frameshift mutation.
Table 4 Impact of polygenic risk score (PRS) on the breast cancer risk conferred by the CHEK2 frameshift mutation.

To test for possible interaction between mutation carriers and the PRS, we first compared the PRS effect size in pooled mutation carriers (PALB2 and CHEK2) and in non-carriers. In both carriers and non-carriers, hazard ratios for the top and bottom decile of the PRS were very similar (reference group PRS 10–90%; Table 5). This was observed also in PALB2 and CHEK2 mutation carriers separately. For PALB2, the HR per SD in carriers was 1.81 (95% CI 1.34–2.44, p = 1.05 × 10−4), in carriers of CHEK2, 1.86 (1.60–2.16, p = 6.58 × 10−16), and in carriers of neither the PALB2 nor the CHEK2 mutation, the HR was 1.71 (1.67–1.74, p < 1.00 × 10−300). Similarly, in a formal test for interaction by introducing an interaction term in the regression model, we found no evidence of an interaction between the PRS and mutations for neither the PALB2 variant (p = 0.18), nor the CHEK2 variant (p = 0.45).

Table 5 To test for interaction in all 122,978 women, we compared the polygenic risk score (PRS) effect size in pooled mutation carriers (pooling PALB2 and CHEK2) and in non-carriers.

PRS refines risk assessment in first-degree relatives

Next, we evaluated how the PRS modifies the risk conferred by a positive first-degree family history. Family history was assessed in 7715 mother–daughter pairs and 12,086 pairs of sisters, separately for family history of early-onset (age < 45) and late-onset (age ≥ 45) breast cancer. For both, PRS stratified women for breast cancer risk, but the stratification was more pronounced in family history of early-onset disease (Fig. 3 and Supplementary Table 4). Women with an average PRS (between the 10th and 90th percentiles) and positive family history of early-onset breast cancer had a lifetime risk at 32.5% (95% CI 24.0–41.0%) – a risk similar to women with a high PRS (>90th percentile) in the full dataset (32.5%, 31.6–33.4%). A combination of family history of early-onset breast cancer and a high PRS further increased the risk to 49.0% (30.1–67.9%), but with only one breast cancer case in the bottom decile we were unable to estimate the impact of a low PRS. We then tested whether family history adds to risk assessment if we know the woman’s PRS. When adjusting with a continuous PRS, the effect size for family history of early-onset breast cancer was attenuated, from HR 2.80 (95% CI 1.81–4.33, p = 4.08 × 10−6), to HR 2.32 (1.50–3.60, p = 1.72 × 10−4). Also for late-onset, the association was attenuated, from HR 1.30 (1.07–1.57, p = 0.01), to HR 1.09 (0.90–1.33, p = 0.37).

Fig. 3: Impact of the polygenic risk score (PRS) in estimating the breast cancer risk of women with a first-degree relative diagnosed with breast cancer.
figure 3

a Shows the impact of family history of early-onset breast cancer, and b the impact of family history of late-onset breast cancer. Adjusted survival curves based on Cox proportional hazards models. Risk estimated in 7715 mother–daughter pairs and 12,086 full sibling-pairs (sisters). The pairs of first-degree relatives were inferred with KING by a kinship coefficient ranging between 0.177 and 0.354 (inference based on 57 K unlinked variants). Due to the sample size, we were unable to assess impact of a low PRS (<10th percentile) with early-onset family history.

High PRS increases risk for contralateral breast cancer

Lastly, we tested the association between the PRS contralateral breast cancer among breast cancer patients. With PRS between the 10th and 90th percentile as reference, a high PRS (>90th percentile) was associated with risk of contralateral breast cancer with HR 1.60 (95% CI 1.25–2.04, p = 0.0002), with 97 individuals out of 1604 cases with a high PRS being diagnosed with contralateral breast cancer.

Discussion

Using large-scale biobank data combining longitudinal nationwide health registry data with genomic information, we show that over the life course, the breast cancer PRS strongly alters the breast cancer incidence in high-impact mutation carriers. After breast cancer diagnosis, individuals with an elevated PRS have an increased likelihood of developing contralateral breast cancer, and the PRS can considerably improve risk assessment among their female first-degree relatives.

The breast cancer PRS strongly altered the risk of breast cancer in PALB2 and CHEK2 mutation carriers, substantially increasing the risk of breast cancer in women with a high PRS, and lowering the risk in women with a low PRS. Deciding on appropriate surveillance and risk-reduction strategies is a clinical challenge particularly for moderate-risk mutations such as those in CHEK213, and our results show that additional information provided by the PRS could guide in these decisions. A combination of breast cancer PRS in the top decile and a mutation in the CHEK2 variant increased the lifetime risk to 59% – a risk comparable to that seen in PALB2 mutation carriers – whereas those with a PRS in the bottom decile had a risk similar to the population level.

That PRS modifies the risk in PALB2 and CHEK2 mutation carriers supports previous findings suggesting that common genetic variation at least partly explains the widely observed incomplete penetrance of mutations in breast cancer susceptibility genes14,15,16,17. This variation is now measurable on an individual level with the breast cancer PRS, which captures a wide range of molecular pathways. Our results are in line with previous studies on BRCA1, BRCA2, PALB2 and CHEK2 mutation carriers, but these studies have used a case–control setting or PRSs consisting of <100 variants16,17,18,19. We conducted the study in a large longitudinal dataset with 120,000 women, using a more predictive, genome-wide PRS and leveraging the considerable enrichment of the PALB2 and CHEK2 variants in an isolated population. With the longitudinal setting, we were also able quantify the lifetime risk in PALB2 and CHEK2 mutation carriers based on observed events over the life course, instead of calculating them using baseline risks from published studies17,18.

Harbouring pathogenic mutations in high-risk breast cancer susceptibility genes often prompt intensified medical surveillance and consideration of preventative procedures such as risk-reducing surgery. The lifetime risk estimates for individuals in the top decile of the PRS was comparable to CHEK2 mutation carriers – both had a risk of 32% by age 80. Considering this, our results also argue for the need of studies on the impact of targeted actions in women with a high PRS only, who currently go undetected. After the diagnosis, patients with elevated PRS had a 1.6-fold elevated risk for contralateral cancer, providing additional evidence of increased breast cancer susceptibility, a finding that might warrant intensified or prolonged surveillance in breast cancer cases with elevated PRS. This finding is in line with earlier studies showing that familial factors contribute to the risk of contralateral breast cancer20,21,22.

The proportion of mutation carriers and the elevated PRS showed differing geographical distributions. While the elevated PRS distribution followed the breast cancer incidence distribution with highest rates in the early-settlement region in South-Western Finland, the allele frequencies for the PALB2 and CHEK2 mutations were highest in the late-settlement region in Eastern Finland. It is likely that the PALB2 and CHEK2 mutations have survived both the founder bottleneck in Finland, and the internal bottleneck in the Eastern Finland, therefore being heavily enriched in the Finnish population. These regional differences in both PRS and mutation frequency distributions may have an impact on regional screening strategies.

Finally, the PRS improved risk assessment of first-degree relatives of women with breast cancer, with pronounced stratification particularly for family history of early-onset disease. Family history is an essential factor guiding screening strategies of family members of breast cancer patients23, and our results show that PRS could improve the precision of this assessment.

Our study has several limitations. Our findings are limited to individuals of European ancestry and it is important to study the applicability of the results in individuals of admixed and non-European ancestry24. The FinnGen study is a mixture of population-based cohorts and samples from hospital biobanks. It is possible that the sampling may introduce biases in some of the estimates. We observed a slightly higher baseline risk compared to the NORDCAN database25. However, our key PRS estimates were similar when estimated in a FinnGen subset of population-based cohorts only. Moreover, accounting for the competing risk of mortality from other causes yielded slightly lower estimates for lifetime risks.

In conclusion, we show that a high breast cancer PRS comes with a comparable risk profile to frameshift mutations in breast cancer susceptibility genes PALB2 and CHEK2, and that the PRS strongly modifies breast cancer risk in the mutation carriers. Even after the breast cancer diagnosis, the PRS was associated with breast cancer susceptibility by increasing the risk of contralateral breast cancer, and it considerably improved risk assessment among the patient’s first-degree relatives. These results demonstrate opportunities for a more comprehensive way of assessing genetic risk in the general population, in breast cancer patients and in unaffected family members of breast cancer patients. Optimisation of these strategies in the clinical setting warrant further study.

Methods

Participants and endpoints

The data comprised of 122,978 Finnish women in the FinnGen, Data Freeze 5. FinnGen comprises prospective epidemiological cohorts (initiated as far back as 1992), disease-based cohorts, and hospital biobank samples (Supplementary Table 5). The unique national personal identification number links the genotypes to the Finnish Cancer Registry (available from 1953, with nationwide completeness of solid tumours at 96%26), as well as to the national hospital discharge registry (1968-), the national death registry (1969-) and the medication reimbursement registry (1964-). These registries cover the whole population.

Breast cancer cases were identified through the Finnish Cancer Registry with diagnosis C50 (International Classification of Diseases for Oncology, 3rd Edition; ICD-O-3), from the drug reimbursement registry by selecting individuals with a reimbursement code for breast cancer, and from the death registry with ICD-10 C50. Contralateral breast cancer was defined as breast cancer in the opposite breast diagnosed over 6 months after the date of the primary breast cancer diagnosis, obtained from the Cancer Registry.

Genotyping and imputation

FinnGen samples were genotyped with Illumina and Affymetrix arrays (Illumina Inc., San Diego, and Thermo Fisher Scientific, Santa Clara, CA, USA), and genotype calls were made with the GenCall or zCall (for Illumina) and the AxiomGT1 algorithm for Affymetrix data. Individuals with ambiguous gender, high genotype missingness (>5%), excess heterozygosity (+-4SD) and non-Finnish ancestry were excluded, as well as all variants with high missingness (>2%), low Hardy–Weinberg equilibrium p-value (<1e-6) and minor allele count (MAC < 3). Array data pre-phasing was carried out with Eagle 2.3.527 with the number of conditioning haplotypes set to 20,000. Genotype imputation was done with Beagle 4.128 (as described in https://doi.org/10.17504/protocols.io.xbgfijw) by using the SISu v3 population-specific reference panel developed from high-quality data for 3,775 high-coverage (25-30x) whole-genome sequences in Finns.

Variants

We chose two previously reported Finnish-enriched frameshift variants for our main analyses, rs180177102 (c.1592delT) in PALB2 and rs555607708 (c.1100delC) in CHEK2. Genotype data batches with an imputation INFO score <0.8 were excluded. This excluded 13,607 women from analyses involving the PALB2 variant (mainly older disease-based cohorts), but no exclusions were needed for CHEK2. PALB2 mutation carrier status was ignored in analyses involving the CHEK2 variant, and vice versa. Women homozygous for the CHEK2 variant were analysed jointly with the heterozygotes.

Polygenic risk score

To choose our breast cancer PRS, we compared three scores: (1) a previously published PRS with 313 SNPs7, (2) another previously published, genome-wide PRS10 built with the software LDpred29 and (3) a genome-wide PRS we built with the software PRS-CS (PRS-CS-auto, with 1000 Genomes Project European sample, N = 503, as the external LD reference panel) using HapMap3 variants11. For the LDpred and PRS-CS PRSs, the input weights came from a large independent genome-wide association study (GWAS)6. To have a PRS independent of the PALB2 and CHEK2 variants, we excluded the variants within the CHEK2 gene ±3 Mb, and variants within the PALB2 gene ±2 Mb (Supplementary Fig. 4). Out of these three, the PRS-CS score showed the strongest association for breast cancer and was therefore chosen for subsequent analyses (Table 1 and Supplementary Table 1). All three PRSs showed acceptable goodness-of-fit (Supplementary Fig. 5). The final variant count for the PRS-CS PRS with PALB2 and CHEK2 excluded was 1,074,667.

A high PRS was defined as a PRS above the 90th percentile, as it corresponds to a lifetime risk of ≥30%, which guidelines consider as the threshold for high risk23. Correspondingly, we defined a PRS below the 10th percentile as a low PRS. Individuals between the 10th to 90th percentiles served as the reference category.

Geographic variation

Geographic variation is reported by region of birth (obtained from Statistics Finland) as the proportion of individuals with (1) the frameshift mutations in the PALB2 or CHEK2 variants, and (2) high PRS (>90th percentile). The benchmark for these analyses was age-standardised (age in 2014) breast cancer incidence for the whole Finnish population, calculated as the mean of 5-year incidences for each hospital district over 1998–2007. The incidence data was obtained from the Finnish Cancer Registry (publicly available at https://cancerregistry.fi/statistics/). Polygon data for the Finnish map were obtained from GADM (https://gadm.org/data.html).

Population structure-related bias analysis

A population structure-related bias analysis was performed by following the approach described in detail in Kerminen et al.30. In brief, the method measures the accumulation of PRS differences between the Western and Eastern subpopulations of Finland using a “random PRS”, made from a randomly chosen set of independent (r2 < 0.1) variants with minor allele frequency >0.05 that are not associated with breast cancer (breast cancer GWAS6 p-value >0.5). If such random PRS accumulated differences between the subpopulations, that could indicate a population genetic bias in effect estimates of the GWAS, rather than a real difference in genetic susceptibility of breast cancer between the subpopulations. We found no evidence of such bias (Supplementary Fig. 6), which indicates that any detected geographic variation in the PRS is unlikely to result from a population genetic bias.

Risk assessment in first-degree relatives

The pairs of first-degree relatives were inferred with KING v2.2.431 by a kinship coefficient ranging between 0.177 and 0.354 (inference based on 57 K unlinked variants). To analyse the impact of family history in first-degree relatives, we randomly sampled one female relative for each woman who had at least one first-degree relative in the dataset. For mother–daughter pairs, the mother was assigned as the index relative. For sisters, we randomly assigned one to be the index relative, irrespective of age. If both women in the pair were breast cancer cases, we used the year of diagnosis to assign the woman diagnosed earlier as the index. Some individuals appeared several times as non-index individuals, which may occur when, for instance, a woman is the daughter of one index individual and the sister of another – we therefore randomly sampled the data to contain each non-index individual only once. We then inferred the risk of breast cancer in these unique non-index individuals. We analysed separately family history of early-onset (age < 45) and late-onset (age ≥ 45) breast cancer.

Statistical analysis

We estimated HRs and 95% CIs with the Cox proportional hazards model, and used Schoenfeld residuals and log–log inspection for assessing the proportional hazards assumption. Start of follow-up was set at birth, and follow-up ended at the first record of the endpoint of interest, death or at the end of follow-up on 31 December 2018, whichever came first. All tests were two-tailed. In all survival analyses, we used age as the time scale, with 63 batches and the first 10 principal components as covariates. The only exception was the analysis on contralateral breast cancer, where follow-up started from the diagnosis, and age was included as a covariate.

Goodness-of-fit for the PRS was assessed with a method proposed by May & Hosmer for a Cox proportional hazards model32. In line with previous studies on breast cancer susceptibility genes, we assessed lifetime risk (cumulative incidence without competing risks) by age 8014,33. Lifetime risk was estimated from the adjusted survival curves, with 95% CIs obtained by normal approximation. The adjusted survival curves were plotted with the R package survminer. This presents the expected survival curves separately for subgroups, based on the Cox model. To estimate the covariate-adjusted cumulative incidence functions in the presence of competing risks, we used the Stata module stcompadj34. The competing event was non-breast cancer causes of death and covariates were assumed to have similar effects the main and competing event.

Interactions between the PRS and the pathogenic mutations were assessed (1) by comparing the PRS effect sizes in pooled and non-pooled mutation carriers and non-carriers (with the PRS scaled to zero mean and unit variance within the whole dataset), and (2) formally by introducing an interaction term for the mutation and the continuous PRS. For data and variant handling and PRS calculation, we used BCFtools versions 1.7 and 1.9, and PLINK 2.0. For statistical analyses, we used R 3.6.3 and Stata 16.0 (College Station, TX, USA). Cromwell and WOMtool were used for workflow handling.

Ethics statement

The FinnGen project is approved by the Finnish Institute for Health and Welfare (THL), approval number THL/2031/6.02.00/2017, amendments THL/1101/5.05.00/2017, THL/341/6.02.00/2018, THL/2222/6.02.00/2018, THL/283/6.02.00/2019), Digital and population data service agency VRK43431/2017-3, VRK/6909/2018-3, the Social Insurance Institution (KELA) KELA 58/522/2017, KELA 131/522/2018, KELA 70/522/2019 and Statistics Finland TK-53-1041-17.

Patients and control subjects in FinnGen provided informed consent for biobank research, based on the Finnish Biobank Act. Alternatively, older research cohorts, collected prior the start of FinnGen (in August 2017), were collected based on study-specific consents and later transferred to the Finnish biobanks after approval by Valvira, the National Supervisory Authority for Welfare and Health. Recruitment protocols followed the biobank protocols approved by Valvira. The Ethics Review Board of the Hospital District of Helsinki and Uusimaa approved the FinnGen study protocol Nr HUS/990/2017.

The Biobank Access Decisions for FinnGen samples and data utilised in FinnGen Data Freeze 5 include: THL Biobank BB2017_55, BB2017_111, BB2018_19, BB_2018_34, BB_2018_67, BB2018_71, BB2019_7 Finnish Red Cross Blood Service Biobank 7.12.2017, Helsinki Biobank HUS/359/2017, Auria Biobank AB17-5154, Biobank Borealis of Northern Finland_2017_1013, Biobank of Eastern Finland 1186/2018, Finnish Clinical Biobank Tampere MH0004, Central Finland Biobank 1-2017 and Terveystalo Biobank STB 2018001. Analyses of potential geographic bias of PRS were done with THL biobank permission BB2019_44.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.