Background Identifying genetic disease-susceptible individuals through population screening is considered as a promising approach for disease prevention. DNA mismatch repair (MMR) genes including MLH1, MSH2, MSH6 and PMS2 play essential roles in maintaining microsatellite stability through DNA mismatch repair, and pathogenic variation in MMR genes causes microsatellite instability and is the genetic predisposition for cancer as represented by the Lynch syndrome. While the prevalence and spectrum of MMR variation has been extensively studied in cancer, it remains largely elusive in the general population. Lack of the knowledge prevents effective prevention for MMR variation–caused cancer. In the current study, we addressed the issue by using the Chinese population as a model.
Methods We performed extensive data mining to collect MMR variant data from 18 844 ethnic Chinese individuals and comprehensive analyses for the collected MMR variants to determine its prevalence, spectrum and features of the MMR data in the Chinese population.
Results We identified 17 687 distinct MMR variants. We observed substantial differences of MMR variation between the general Chinese population and Chinese patients with cancer, identified highly Chinese-specific MMR variation through comparing MMR data between Chinese and non-Chinese populations, predicted the enrichment of deleterious variants in the unclassified Chinese-specific MMR variants, determined MMR pathogenic prevalence of 0.18% in the general Chinese population and determined that MMR variation in the general Chinese population is evolutionarily neutral.
Conclusion Our study provides a comprehensive view of MMR variation in the general Chinese population, a resource for biological study of human MMR variation, and a reference for MMR-related cancer applications.
- genetic variation
- DNA repair
- genomic instability
- human genetics
Data availability statement
All data relevant to the study are included in the article or uploaded as online supplemental information.
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
Cancer prevention is one of the ultimate goals in medicine.1 Populational screening for cancer-causing genetic predisposition has been proposed as a promising approach to reach the goal as it allows comprehensive identification of the predisposition carriers for early prevention before cancer development.2 3 The rapid progress of DNA sequencing technology is making populational screening for cancer prevention closer to reach the reality.4–7 However, the scientific base of population screening needs to be well established.8–10 One of the issues relating to the coming paradigm switching is the basic knowledge for cancer predisposition in the general population.
DNA mismatch repair (MMR) genes including MLH1, MSH2, MSH6 and PMS2 play essential roles in maintaining microsatellite stability through repairing DNA mismatch errors.11 Pathogenic variants in MMR genes damage their normal function, resulting in microsatellite instability and increased risk for developing multiple types of cancer as best represented by the Lynch syndrome (LS), the cancer caused by MMR disorder affecting the gastrointestinal system.12–14 Since the relationship between MMR variation and cancer was revealed, MMR variation has been extensively characterised in patients with cancer. The resulting MMR variation data are widely used to guide clinical diagnosis and treatment. Most of the MMR variation data currently available were derived from cancer samples, however, the information reflects mainly the status of MMR variation in patients with cancer. Knowledge of MMR variation in the general population without cancer is essential in order to prevent cancer development in the population. Referring the variation data from patients with cancer to the general population can be largely erratic, as MMR variation such as the spectrum, frequency and penetrance could be substantially different between patients with cancer and the general population.15 16 Furthermore, the issue of ethnic-specific MMR variation further increases the complexity as the current MMR variation data were predominantly derived from the Caucasian populations.17
In the present study, we addressed MMR variation in the general population by using the ethnic Chinese population as a model, for which MMR variation has not been systematically characterised so far. We mined MMR variant data from 18 844 ethnic Chinese individuals, determined the prevalence and spectrum, and characterised the features of the variants including ethnic specificity, deleteriousness and evolution selection. Data generated from our study establish a scientific foundation for the use of population screening to prevent MMR-related cancer, a resource to study human MMR variation and a reference for MMR-related clinical applications.
Materials and methods
Samples and variant data collection
Genomic sequence data from 18 844 ethnic Chinese individuals were used in the study. These included the whole genome sequences from 10 588 Chinese individuals by the ChinaMAP project (https://creativecommons.org/licenses/by/4.0/),7 the whole genome sequences from 2657 Singapore ethnic Chinese by the Singapore SG10K project,18 the whole genome sequences from 597 Chinese by the Chinese Academy of Sciences Precision Medicine Initiative project,19 the whole genome sequences from 90 Chinese by the Han Chinese study,20 the whole exome sequences from 610 normal Chinese control by the Chinese breast cancer study21 and the MMR-targeted sequences from 4302 Macau Chinese by our own study (approved by the University of Macau Institutional Review Board, BESRE17‐APP014‐FHS) (table 1A).
The quality of sequence data was checked by FastQC and duplicates were removed by Trimmatic. Sequencing reads were mapped to human genome reference sequences (hg19) by Burrows-Wheeler Aligner.22 Sequencing bias was marked by Base Quality Score Recalibration and sorted by Picard. Variants were called using GATK 4.1 following GATK best practices protocol,23 annotated and classified by ANNOVAR.24 Mutalyzer name checker and position converter tool were used for cDNA position and coding change with the following reference sequences: Genome: NC_000003.11 (MLH1), NC_000002.11 (MSH2 and MSH6), NC_000007.13 (PMS2); cDNA: NM_000249.3 (MLH1), NM_000251.2 (MSH2), NM_000179.2 (MSH6), NM_000535.5 (PMS2); protein: NP_000240.1 (MLH1), NP_000242.1 (MSH2), NP_000170.1 (MSH6), NP_000526.2 (PMS2). The following databases were used for comparative analyses: gnomAD (http://gnomad.broadinstitute.org/),25 ExAC (http://exac.broadinstitute.org/)26 and database of 1000 Genomes Project (https://www.internationalgenome.org)27 were used for comparing with non-Chinese general population; ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/),28 InSiGHT (https://www.insight-group.org/variants/databases/),29 COGR (http://opengenetics.ca/)30 and UMD (http://www.umd.be)31 were used for comparing with non-Chinese cancer cohort; dbSNP150 (https://www.ncbi.nlm.nih.gov/snp/)32 and the databases listed above were used for novelty determination. The variants reported in any of the databases mentioned above were classified as known variants; the variants not reported were classified as novel variants. Power calculation was performed following procedures.33
Deleterious impact of variants on protein structural stability
Ramachandran Plot Molecular Dynamic Simulation (RPMDS) method was used following the procedures described in detail.34 Briefly, the N terminus of MLH1 crystal structure (PDB:4P7A, 2.30 A, residues 1–340) was used as the template. Altered amino acid residues caused by missense variants were incorporated into the wild-type MLH1 structure to generate mutant structures by using the Chimera software.35 Molecular dynamics simulation composed of five different programs of RMSD (root-mean-square-deviation),36 RMSF (root-mean-square-fluctuations),37 Rg (radius of gyration),38 SASA (solvent accessible surface area)39 and NH bond (hydrogen bond)40 were applied to measure the structural changes of equilibrium state, flexibility, shape, hydrogen bond, surface accessibility and structural expansion caused by the variants. Ramachandran plot was used to qualify and quantify the changes. By comparing the data from known pathogenic and benign variants, the variants with deleterious impact on protein structure stability were identified.
Variants from the Singapore SG10K project were used for the analysis. Ka/Ks ratio was calculated by using dNdSloc model in the dNdScv R package.41 42 Tajima’s D,43 the normalised Fay and Wu’s H test,44 the DH test and the E test45 were performed by using the Readms package of DH software.
MMR variants identified in the general Chinese population
A total of 18 844 ethnic Chinese individuals were included in this study, including 11 885 (61.8%) mainland Chinese, 4302 (23.6%) Macau Chinese and 2657 (14.6%) Singapore Chinese (table 1A, figure 1A, online supplemental table S1). Online supplemental figure S1 outlines the analytical process. From the sequence data, we identified 17 687 distinct variants in MMR genes of MLH1, MSH2, MSH6 and PMS2, of which 1166 (6.6%) were located in coding regions and 16 521 (93.4%) were located in non-coding regions (table 1, online supplemental tables S2, S3). Of the 1166 coding variants, 210 were in MLH1, 248 in MSH2, 493 in MSH6 and 215 in PMS2 (table 1B); 57.4% were singleton, 27.0% with 2–5 carriers, 6.1% with 6–10 carriers, 5.8% with 11–50 carriers, 0.4% with 51–100 carriers and 3.3% with >100 carriers (figure 1B, online supplemental table S2); of the 13 variation types, missense variant had the highest rate (50.3%), 24.1% remained as unclassified variants (table 1B) and 45.2% as novel variants (table 1C). Ts/Tv ratio was 3.53. G>A had the highest frequency of 18.8% (table 1D). Multiple variation hotspots were identified in MMR genes: 300th to 500th residues in MLH1, 400th to 600th residues in MSH2, and several hotspots in different functional domains in MSH6 and PMS2 (figure 2). Of the 16 521 non-coding variants, 3416 were in MLH1, 8629 in MSH2, 2052 in MSH6 and 2424 in PMS2; 90.4% (14 941) were located in intron (table 1E), 69.8% (11 528) were not present in dbSNP, 62.9% (10 392) were singleton (online supplemental table S3) and 52.2% were within repetitive sequences (online supplemental table S4). Ts/Tv ratio was 3.38, lower than the 3.53 in the coding variants. C>T had the highest frequency of 20.7% (table 1D).
MMR variation between general population and cancer cohort
We compared the MMR coding variants between the general Chinese population and the Chinese patients with cancer generated by our previous study (online supplemental table S5).17 We observed low overlapping rates between the two datasets: only 16 variants in MLH1, 18 in MSH2, 12 in MSH6 variants and 23 in PMS2 variants were shared in between (figure 1C, online supplemental table S6). We also searched the 1166 coding variants in ClinVar and InSiGHT databases. Of the 225 coding variants matched in the databases, colorectal cancer had the highest (142 variants) and LS was the second highest (100 variants) (online supplemental table S7), indicating that the coding variants were more oncogenic in colorectal tissue.17
MMR variation between Chinese and non-Chinese populations
We compared MMR data between the Chinese and non-Chinese populations. We first compared the data of general populations using the gnomAD, ExAC and 1000 Genome databases after removing the variants derived from ethnic Chinese in these databases. We observed that only 42.1% (491) of the 1166 coding variants and 1.9% (319) of the 16 582 non-coding variants were shared between the general Chinese and non-Chinese populations (table 2A,B). Next, we compared the Chinese population with the non-Chinese cancer data from the ClinVar, InSiGHT, COGR and UMD databases after removing the variants derived from ethnic Chinese in these databases. We observed that only 47.4% (553) coding variants and 1.1% (181) non-coding variants in the Chinese population were shared with MMR variants derived from non-Chinese patients with cancer. The shared variants had the highest overlapping rate in ClinVar for both coding (624, 53.5%) and non-coding (176, 1.1%) variants, followed by InSiGHT database for coding (200, 17.2%) and non-coding (119, 0.7%) variants (table 2A,B). In total, 63% (735) of Chinese coding variants and 4.1% (688) of Chinese non-coding variants were shared with non-Chinese populations. Of the 37% (431) of coding variants present in Chinese only, 34.3% (148) were nonsynonymous variants and 2.6% (11) were loss-of-function variants (4 stopgain, 3 frameshift insertion, 2 frameshift deletion and 2 no frameshift insertion) (online supplemental table S2). The differences between the Chinese and non-Chinese population imply that MMR variation is highly ethnic-specific.
Pathogenic variants in the Chinese population
Using ClinVar as the reference, 898 of the 1166 (77.0%) coding variants matched in ClinVar were classified into different clinical classes including 24 (2.0%) as Pathogenic and Likely pathogenic, 507 (43.5%) as Variants of Unknown Significance (VUS), 282 (24.2%) as Benign and Likely benign, 83 (7.3%) as Conflicting classification and 268 (23.0%) as Unclassified (table 3A). There were 33 carriers for the 24 Pathogenic/Likely pathogenic variants in the 18 844 general Chinese individuals included in the study; 26 (78.8%) carriers were in MSH6 and PMS2 (table 3B). This resulted in the prevalence of 0.18% for MMR pathogenic variants in the tested general Chinese population, including 0.005% in MLH1, 0.032% in MSH2, 0.053% in MSH6 and 0.085% in PMS2 (p<0.005). Power calculation showed that screening 18 844 individuals at a prevalence of 0.18% provides a 97.8% probability of detecting all pathogenic variants in the studied population. The 0.18% prevalence implies the presence of 2.52 million MMR pathogenic variant carriers estimated in the 1.4 billion Chinese population. We performed the same analysis for the MMR data in the gnomAD database, and observed a 0.11% prevalence of MMR pathogenic variants in the general non-Chinese population, composed of 0.020% in MLH1, 0.015% in MSH2, 0.024% in MSH6 and 0.052% in PMS2. Therefore, the general Chinese population had a higher prevalence than general non-Chinese populations in MSH2, MSH6 and PMS2 but lower prevalence in MLH1. In the Chinese population, the pathogenic variants c.3226C>T (MSH6), c.825A>G (PMS2) and c.82G>A (MSH2) had 5, 4 and 3 carriers, respectively, suggesting that they may be the potential founder mutations. We further collected cancer distribution information from the ClinVar database for the identified Pathogenic and Likely pathogenic variants. Similar to the distribution of MMR non-pathogenic variants, the 24 Pathogenic/Likely pathogenic variants had the highest frequency in colorectal cancer (table 3C).
Ethnic-specific pathogenic variants
Of the 1166 coding variants, 507 (43.5%) were classified as VUS and 268 (23.0%) remained as Unclassified (table 3A). The presence of 66.5% as functional unknown variants raises the question if these unclassified variants had any biological significance. We tested this possibility by measuring the impact of the unclassified variants on MLH1 protein structure stability using the structure-based RPMDS method.34 By referring to the cut-off values from wild-type, known benign and known pathogenic variants to differentiate deleterious and non-deleterious variants, we tested the Unclassified variants (online supplemental table S8). We selected the Unclassified variants for the test under the following conditions: (1) the variants must be within the known MLH1 structure as the RPMDS method relies on the known protein structure in its analysis. Of the 756 amino acid residues in MLH1, the known structure (PDB:4P7A) covers 340 aa (1–340). (2) The variant must be missense as the RPMDS is designed for missense variant analysis. Under these conditions, we identified 16 Unclassified MLH1 variants for the analysis. We identified six variants as deleterious that significantly disturbed MLH1 structure (online supplemental table S9). For example, K241T had deleterious effects as it destabilised the two alpha helixes formed between residues 234 and 340 in MLH1 (online supplemental figure S2). The results indicate that the Unclassified variants were enriched with Chinese-specific deleterious variants.
Evolutionary analysis of Chinese MMR variation
We performed evolutionary analysis for MMR variation in the Chinese population. We calculated Ka/Ks ratio for each MMR gene with 0.83 in MLH1, 0.92 in MSH2, 1.07 in MSH6 and 1.17 in PMS2 (figure 3A). We observed that A/S ratios were all close to 2.5 (2.08 in MLH1, 2.29 in MSH2, 2.80 in MSH6 and 2.73 in PMS2). We validated the results by using multiple tests of Tajima D, Fay & Wu H, DH and Zeng E (figure 3B), and concluded that MMR genes in the Chinese population were evolutionarily neutral.
Several conclusions can be made from our current study:
Modest prevalence of MMR variation in the general population. Of the 1166 MMR coding variants identified in the general Chinese population, we identified 24 Pathogenic/Likely pathogenic variants with 33 carriers. Based on the data, we determined the 0.18% prevalence of MMR pathogenic variation in the 18 844 ethnic Chinese individuals, highlighting the presence of 2.52 million MMR pathogenic variant carriers in the 1.4 billion Chinese population. It is important to note that a group of autosomal dominant cancer predisposition mutations has much higher prevalence in the general population than these in many non-cancer hereditary diseases, such as spinocerebellar ataxia (SCA), in which there are only a few carriers per 100 000 individuals.46 For example, the prevalence of pathogenic mutation in BRCA1/BRCA2 (BRCA) is 0.26% in Japanese (1 in 384),47 0.38% in Chinese (1 in 265),48 0.38% in Mexicans (1 in 265),49 0.39% in Malaysians (1 in 556),50 0.53% in US population (1 in 189)51 and 2.17% in the Ashkenazi Jewish (1 in 46).52 The 0.18% prevalence in the Chinese population is the combination of four MMR genes of MLH1, MSH2, MSH6 and PMS2. With 1 in every 556 Chinese individuals an MMR pathogenic variant carrier, it indicates the serious threat of MMR-related cancer risk for the public health and the importance of preventing MMR-related cancer in the general Chinese population. However, the 0.18% prevalence is lower than the 0.38% prevalence for BRCA pathogenic variation in the general Chinese population.48 Although MMR and BRCA are the two groups of cancer predisposition genes with the most significant clinical value over other cancer predisposition genes,53 priority will be given to BRCA screening first if only one choice can be made when planning a population screening for cancer prevention. Alternatively, MMR and BRCA can be combined as one panel for the screening with the expected outcome of identifying twice more of BRCA pathogenic variant carriers over MMR carriers.
Different spectra of MMR variation between the general population and patients with cancer. This is reflected by the low overlapping of MMR variants between the two cohorts, and most of the pathogenic variants in the general population were in MSH6 and PMS2 but in patients with cancer were in MLH1 and MSH2. This implies that the MMR variation data derived from patients with cancer cannot be directly applied as the standard reference to judge MMR variation in the general population, as the former reflects the MMR pathogenic variation enriched in the cancer cohort whereas the latter reflects the genetic variation distributed in the general population.
Highly ethnic-specific MMR variation in the general population. This is reflected by the presence of 68.2% of novel MMR variants in the Chinese population (table 1C). A similar situation in BRCA variation was also present in the general Chinese population48 and may also exist in other ethnic populations. This feature reminds the limitation of the MMR data currently available as they were mostly derived from the Caucasian populations, and highlights the need to collect variation information in cancer predisposition genes from different ethnic populations.
The challenge of classifying ethnic-specific MMR variants. Over 66% of Chinese MMR variants remain unclassified. It is difficult to classify the ethnic-specific variants and identify the potential ethnic-specific pathogenic variants due to the lack of clinical evidence, resources, expertise and references available in existing MMR databases. Substantial efforts need to be made to improve the situation.
Human MMR system is not under obvious evolution selection. Our study demonstrated that MMR genes in the Chinese population were neutral without obvious positive or negative selection. This is in contrast to the situation in human BRCA1, in which strong positive selection is present.54 It is interesting to indicate that evolution selection can act differentially in different DNA damage repair genes/pathways for better fitness.
Potential value of MMR non-coding variants. Of the 17 687 variants identified in MMR genes, 16 521 (93.4%) were in non-coding regions mostly in introns, and 12 055 (68.2%) were not present in dbSNP. Their rich presence highlights the highly variable nature in MMR non-coding regions and further exploring their potential clinical relevance is warranted.
Certain limitations are present in our study, such as lack of sex and age information in the variation data as these were not available in the original data sources. For the variants detected only in single individuals, possibility exists that certain "singleton" MMR variants could be generated by sequencing errors instead of true genetic variants. In addition, the actual prevalence of MMR pathogenic variants could be higher than observed as genomic data used in our study were mainly collected by short sequences from the next-generation sequencing platform, which is sensitive in detecting single-base and small indel variants but lacks power to detect large structural variations. Further functional test of the rich MMR variants will help to identify the driver variants contributing to the oncogenic process.55
In summary, our study provides a populational view for MMR variation in an ethnic human population, a scientific basis in planning population screening for MMR-related cancer prevention, and a reference resource for biological study of human MMR variation and MMR-related clinical applications.
Data availability statement
All data relevant to the study are included in the article or uploaded as online supplemental information.
Patient consent for publication
The MMR-targeted sequences in Macau Chinese was approved by University of Macau Institutional Review Board (BESRE17‐APP014‐FHS).
We thank the late Dr Henry Lynch for his inspiration and encouragement in studying MMR-related cancer in the Chinese population. We thank the ‘SG10K_Pilot Investigators’ for providing the SG10K_Pilot data (EGAD00001005337). The data from the ‘SG10K_Pilot Study’ reported here were obtained from EGA. We are also thankful for the Information and Communication Technology Office (ICTO), University of Macau for providing the High-Performance Computing Cluster (HPCC) resource and facilities for the study.
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
LZ and ZQ contributed equally.
Contributors LZ: method development, analysis, data interpretation, draft manuscript; LZ, ZQ, HT, BT, XWu, XWang, LW, YR, GM, JL, BZ, JSC: data acquisition, data interpretation; SMW: conception, design, analysis, data interpretation, manuscript revision and funding.
Funding This work was funded by Macau Science and Technology Development Fund (085/2017/A2, 0077/2019/AMJ), University of Macau (SRG2017-00097-FHS, MYRG2019-00018-FHS), Faculty of Health Sciences, University of Macau (FHSIG/SW/0007/2020P, Startup fund) (SMW).
Disclaimer This manuscript was not prepared in collaboration with the ‘SG10K_Pilot Study’ and does not necessarily reflect the opinions or views of the ‘SG10K_Pilot Study’.
Competing interests None declared.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.