Statistics from Altmetric.com
There is nothing as mysterious as the unknown. This is also true in genetics. For this reason, scientists sequenced the human genome more than a decade ago.1 ,2 The aims of the Human Genome Project were to gain insights into the organisation of our genome, but also to understand the role of genetic variation in human diseases and other traits. We have made tremendous progress in assigning functions to each of the ∼3.3 billion nucleotides that constitute our genetic code, although much work remains.3 By comparing our genome sequence with the sequence of other species, we are also starting to learn why we, humans, are different. And by analysing the genome sequence of different human populations, we are beginning to unravel how our genome impacts our phenotypes, including our risk to develop diseases. In this article, I briefly review the types of segregating genetic variation detected in the human genome, with an emphasis on the characterisation of rare and low-frequency sequence variants (figure 1). I arbitrarily define variants with a minor allele frequency (MAF) <0.1% as rare, whereas low-frequency and common variants have MAF of 0.1%–1% and >1%, respectively. My main aim is to draw conclusions on our early successes in order to guide the design of better studies to find genetic associations between rare or low-frequency variants and human complex phenotypes. Although clearly important, I do not discuss the role of de novo or somatic mutations in human phenotypical variation, nor will I extensively describe the different statistical methods specific to the analysis of rare variants. These topics have been recently discussed in excellent review articles.4–7
The human genetic variation that we (think we) understand
Over the last 40 years, positional cloning, linkage studies and DNA sequencing allowed investigators to identify hundreds of mutations responsible for rare human diseases that follow Mendel's laws of inheritance. These mutations, along with the corresponding genotype–phenotype correlations, are extremely well documented in the National Center of Biotechnology Information's Online Mendelian Inheritance in Men database (OMIM; http://www.omim.org/). Thanks to the development of next-generation DNA sequencing (NGS) technologies,8 Mendelian genetics continues to be in the front-line of research, with weekly reports of new genes mutated in rare human disorders or syndromes. In particular, whole-exome sequencing (WES) makes it possible to identify aetiological mutations for extremely rare diseases even in the absence of pedigrees, a major limitation for the linkage approach.9
Until recently, the genetic causes of common human diseases (eg, diabetes, myocardial infarction) and other complex traits (eg, height, blood cholesterol levels) also remained a mystery. The seminal theoretical work by Fisher, published in 1918, predicted what geneticists should be looking for: a large number of genetic variants, each with a very small effect on phenotypes.10 But it took ∼90 years before we could combine conclusions from ground-breaking work on the patterns of common genetic variation in the human genome11–16 with new genome-wide genotyping technologies to tackle complex trait genetics. We have now identified genetic associations between thousands of ‘common’ bi-allelic SNPs and human phenotypes.17 ,18 These genome-wide association studies (GWAS) have yielded new insights into human biology in health and diseases. Translating these GWAS discoveries is the next frontier. With novel tools (eg, TALEN and CRISPR/Cas9 genome editing methods19 ,20) and resources (eg, epigenomic data from the ENCODE and Roadmap Epigenomic Projects21 ,22 and transcriptomic data from FANTOM523 ,24) available, wet-lab experimentalists can now make significant progress to understand the molecular mechanisms that drive human phenotypical variation.
When comparing two human genomes, most of the differences in terms of nucleotide changes reside not at SNPs but in large (>1 kilobase) structural variants, such as insertions, deletions and duplications.25 Improvements in array and sequencing technologies helped to generate accurate, single base resolution maps of these copy number variants (CNVs).25 ,26 The excitement and expectations were high regarding the potential influence of CNVs on human phenotypes. Investigators identified associations of CNVs with complex human diseases and traits, including neurocognitive disorders,27–29 Crohn's disease30 and body mass index,31 ,32 but the number of such associations remained low. This is, in part, due to the technical difficulty in obtaining accurate CNV genotypes in large populations.33 In a study that tested 3432 CNVs for association with eight common human diseases in ∼19 000 participants, the Wellcome Trust Case Control Consortium did not report novel associations.33 An important conclusion of that study, however, is that most common CNVs are in LD with SNPs normally surveyed by genotyping arrays.33 Therefore, current large meta-analyses of GWAS results test indirectly the effect of a large subset of common CNV on human phenotypical variation. Although there is probably more than meets the eye, and this may change as we explore further our genome, the current role of common structural variants in complex human diseases and traits appear limited.
Rare and low-frequency variants: we know they exist, but we don't really understand them (yet)
One of the main conclusions of the 1000 Genomes Project is that that most of the genetic variation in our genome is rare and private to the different human populations.15 ,16 Despite remaining challenges (table 1), studying rare and low-frequency variants is the new hype in human genetics for at least three reasons. First, despite its success in finding thousands of SNP associations, the GWAS approach has not yet identified most of the genetic variation that contributes to disease risk of trait variation—the so-called missing heritability paradox.34 Although theoretical and empirical analyses have determined that a large fraction of the heritability is not missing but, in fact ,hidden in GWAS results,35 ,36 it is also true that rare and low-frequency variants, which are usually not tested by genome-wide genotyping arrays, could influence phenotypes. The identification of rare coding variants can also help pinpoint which genes are causal within GWAS loci. Second, early findings in rare variant genetics suggested that this class of variation might have large effects on phenotypes.37 This is intuitive: the frequency of strong detrimental alleles should be controlled by purifying selection and is also consistent with the observation that most common SNPs identified by GWAS have weak effects. The poster child example of this rationale is the identification of low-frequency missense variants in PCSK9 that are associated with low low-density lipoprotein (LDL)-cholesterol levels and reduced coronary heart disease risk.38 This finding led to the development of a new class of therapeutics to treat patients with hypercholesterolaemia, paving the way for similar approaches following genetic discoveries.39 As we will discuss below, it seems that the large phenotypical effect observed for PCSK9 coding variants is more an exception than the rule. In fact, the weak phenotypical effect observed for many rare variants is consistent with early population genetic work. By considering mutations that cause Mendelian diseases, human–chimpanzee divergence and DNA sequence data in a large number of individuals, investigators showed that most rare missense mutations are deleterious in humans and may therefore influence complex human phenotypes. However, the estimated selection coefficients that best fit the data are small, suggesting that most rare deleterious missense variants have small effects on fitness.40 And finally, from a more practical point of view, rare variant experiments in large DNA collections are only now becoming possible with NGS technologies. It does remain expensive and analytically complicated, but NGS is mature. Several large-scale sequencing projects are now ongoing or completed, such as the Exome Sequence Project that surveyed genetic variation in the exome of 6515 cohort participants.41
Initially, DNA resequencing efforts to find rare variants were targeted to candidate genes. These genes were selected based on previous molecular, cellular or genetic (Mendelian diseases, GWAS results) knowledge. Such approach was proven to be successful for blood lipid traits,38 ,42–44 but also for other phenotypes such as type 1 and 2 diabetes,45 ,46 fetal haemoglobin levels47 and age-related macular degeneration (AMD).48 ,49 A main challenge when sequencing excellent candidate genes pertains to distinguishing pathological from neutral mutations. Two recent studies sequenced genes implicated in diabetes and cardiomypathies and identified a large number of non-synonymous variants in healthy individuals, highlighting the difficulty in using this genetic information to develop prognostic tests.50 ,51 Validating functionally the impact of DNA sequence variants identified remains a priority, and a series of guidelines to demonstrate causality in genotype–phenotype analyses was recently proposed.52
Except for neurocognitive disorders, for which NGS has implicated de novo variants,6 there are currently few examples of WES or whole-genome sequencing (WGS) experiments that have identified rare or low-frequency variants associated with complex human diseases or traits. WES of 91 patients with cystic fibrosis (a monogenic disease) identified missense variants in DCTN4 that are associated with resistance to Pseudomonas aeruginosa infections (a complex trait).53 WGS in 962 participants did not identify new genetic association with high-density lipoprotein-cholesterol,54 whereas WES in 2005 individuals found rare variants in one gene, PNPLA5, that are associated with LDL-cholesterol.55 In the cystic fibrosis and LDL-cholesterol studies, 91 and 554 individuals were selected from the extremes of bacterial resistance and LDL-cholesterol levels, respectively. Under an additive genetic effect model, this ‘extreme’ study design increases statistical power to find variants while limiting the number of samples to sequence.56
There is one example where WGS has been successful for common human diseases. The Iceland-based deCODE genetics company has reported several associations between strong effect rare/low-frequency variants identified by WGS and diseases. These include variants in TREM2 and APP associated with Alzheimer's disease,57 ,58 a nonsense variant in LGR4 with osteoporosis,59 a variant in C3 with AMD60 and several variants with type 2 diabetes.61 Importantly, other investigators have replicated some of the associations with Alzheimer's disease, AMD and type 2 diabetes.48 ,49 ,62–64 For all these findings, deCODE's approach was similar: they identified genetic variation in the Icelandic population by WGS of ∼2000 participants. Then, they imputed the identified genetic variants using long-phase haplotyping methodology in ∼90 000 participants genotyped on GWAS-type arrays. Finally, they used the extensive genealogy of this population to infer genotypes in >250 000 individuals. Although the sample size of these studies is very large, the number of cases remains in the ‘normal’ range for association studies: for instance, there were 1143 and 11 114 cases in the recent AMD and type 2 diabetes studies, respectively.60 ,61 The high control-to-case ratio (45:1 for AMD, 24:1 for type 2 diabetes) improves power, although it stabilises as the number of controls increases. deCODE's successes are also explained by the phenotypical, genetic and environmental homogeneity of its participants, which minimises potential confounders. This might be particularly important for association studies of rare and low-frequency variants.65 ,66 Further supporting the importance to work with homogenous populations, a WES experiment in large families identified a rare missense variant in PLD3 that is associated with late-onset Alzheimer's disease.67 The deCODE studies highlight that population isolates and large pedigrees might be particular useful for rare and low-frequency variant studies. Furthermore, imputing variants into already genotyped samples might be a powerful approach to minimise sequence costs while maximising power. Recently, we used a similar strategy—WES in 761 African-Americans and imputation in ∼13 000 genotyped African-Americans—to find new associations with blood cell phenotypes.68
Sequencing by direct genotyping
One of the conclusions from the early large-scale NGS experiments is that we need large sample size to find new genetic associations with rare or low-frequency variants. The variants identified so far have large effect size—often OR >2—but we have found only a handful despite having sequenced large cohorts with different complex phenotypes available. And retrospectively, we probably have not performed to date well-powered NGS experiments: we found large effect variants because we only had power to find such variants. Based on our few findings, it seems likely that most rare or low-frequency variants will have modest-to-weak effect on phenotypes. But how to test rare/low-frequency variants in tens of thousands of samples?
Exome arrays were designed precisely to answer this need, that is, to develop a tool that would allow large-scale testing of coding variation in very large sample sizes at moderate costs (<10% of what WES costs if we include analysis time). To design the Illumina Infinium HumanExome Beadchip, investigators combined genetic variation identified by WES or WGS of ∼12 000 individuals and selected ∼250 000 variants for the exome array (http://genome.sph.umich.edu/wiki/Exome_Chip_Design). These variants have been seen at least three times in two different studies and are highly enriched for protein altering functions (missense, nonsense, splice site). Affymetrix has also generated a similar exome array. Exome chips are convenient because of their simplicity, but also have certain limitations. First, many coding and all non-coding rare variants are not tested by exome arrays. For an exhaustive analysis of this class of genetic variation, direct DNA sequencing remains necessary. Second, exome chips might not capture as well coding variation in different populations. Most of the sequence data used to generate the genetic variation catalogue for the exome chip was from individuals of European ancestry. Thus, exome chip experiments in other populations might miss a large fraction of the coding variation that is ancestry-specific or population-specific. As a dramatic example, we recently sequenced the exome of 164 African-Americans that were also genotyped on the Illumina exome chip: 67% of the coding variation—mostly very rare, however—was not surveyed by the exome array (Ken Sin Lo and GL, unpublished). This is an important flag to remember in deciding between NGS and exome chip genotyping for experiments in non-European ancestry populations, especially because LD will not be helpful to tag variants at such low MAF.
Genetic discovery experiments based on the exome array approach already have some successes (table 2). The first report focused on insulin processing and secretion in individuals from Finland.64 The authors identified four missense and one nonsense variants strongly associated with these insulin traits. Two of these variants fell within, but were independent from, GWAS signals for the same phenotypes; these low-frequency variants implicate SGSM2 and MADD as causal genes for insulin secretion (table 2). The three remaining variants did not overlap with GWAS loci for insulin indexes. This study identified the same variant in PAM (p.Asp563Gly) that was found to be associated with type 2 diabetes risk by the deCODE group.61 Blood lipid traits were also analysed in large populations genotyped on exome arrays, leading to the identification of coding variation at five loci (table 2).69 ,70 A low-frequency variant in TM6SF2 (p.Glu167Lys) is associated with total cholesterol levels and alanine transaminase (a marker of liver injury), as well as two related clinical endpoints: myocardial infarction and non-alcoholic fatty liver disease.70 ,71 This TM6SF2 variant explains the GWAS signal for these phenotypes at the locus. Finally, we used the exome chip to identify coding variants associated with blood cell phenotypes in ∼30 000 Europeans or individuals of European descent.72 We reported the first erythropoietin variant associated with haemoglobin and haematocrit levels, a rare missense variant in the thrombocytopenia gene TUBB1 associated with platelet count, and a collection of eight missense variants in the chemokine receptor gene CXCR2 associated with white blood cell counts (table 2). We further demonstrated that a CXCR2 frameshift mutation segregating in a family is responsible for congenital neutropenia.72 Several large consortia, with access to exome chip genotype data for hundreds of thousands of individuals, are in progress and should yield many additional rare and low-frequency coding variants associated with human phenotypes.
And there is the part of our genome that we don't understand: repetitive sequences
We often present NGS methods as a solution to all our genetic problems given their unprecedented capacity to generate DNA sequences. But we forget that a non-negligible fraction of our genome—repetitive DNA sequences that cover over half of the human genome—is largely refractory to this technology. Repeats correspond to segments of DNA, almost identical, that can be found at several locations and on different chromosomes. They can be short (1–2 bps motif) or long (several kilobases). The transposon element Alu is our most abundant repetitive sequence, representing ∼11% of the human genome.1 ,2 Variation in the number of repeats at specific loci has been linked to many human pathologies, most notably the expansion of triplet nucleotides in Huntington's disease, fragile X syndrome, myotonic dystrophy and other disorders.73 From a NGS perspective, repeats are problematic because the corresponding sequence reads are usually too short and cannot be mapped unambiguously. This introduces sequence errors and difficulties in interpreting results.74
Medullary cystic kidney disease type 1 (MCKD1) is a Mendelian disease that was mapped to a two megabases interval on chromosome 1 by linkage studies more than a decade ago. More recently, investigators used WES and WGS but did not find mutations that segregated perfectly with disease status in affected pedigrees. They eventually used ‘old-fashioned’ positional cloning, capillary sequencing and de novo assembly methods to discover that MCKD1 is caused by a cytosine insertion in one repeat of a variable number tandem repeat (VNTR) in the MUC1 gene.75 The MUC1 VNTR, very guanine–cytosine-rich, could not be sequenced by WES and was under-represented in the WGS data. The identification of the causal mutation for MCKD1 serves as an illustrative example in considering the challenges to analyse repetitive DNA sequences by NGS. Whether such repeat sequence variation (common or rare) could also impact complex trait genetics remains to be tested.
Driven by the sequencing of the human genome and technological advancements, human geneticists have made great progress in the identification of genetic variation that cause simple and complex human diseases or that influence other human phenotypes. The new excitement in the field is in the characterisation of rare and low-frequency variants, in part because such variants might have larger phenotypical effects and might therefore be more clinically actionable than GWAS SNPs in the context of personalised medicine and drug development. Although there are clearly rare/low-frequency large-effect variants, their number is likely going to be small given insights from the completed studies. Large sample sizes are needed for comprehensive studies of rare and low-frequency variants. Other challenges include the development of new statistical methods to test association between functionally related groups of variants (gene-based, but could also be pathway-based, promoter-based or enhancer-based) as well as to explore the contribution of rare non-coding genetic variation on human phenotypical variation. Finally, because rare variation is mostly population-specific, it will be important to improve methods to correct for confounders such as population stratification because existing approaches are not appropriate.65 ,66 This is particularly important to avoid some of the early pitfalls of the common variant association testing the literature.76 The coming years will mark another chapter in the history on the exploration of our genome. It will be interesting to see how this rare/low-frequency variant adventure contrasts with the previous chapters on positional cloning, capillary sequencing and GWAS. And how it may provide ideas and tools to study in the future repetitive DNA sequences as it relates to human phenotypical variation.
I would like to thank Chris Cotsapas and Ekat Kritikou, as well as all the members of my laboratory for suggestions and comments on an early version of this manuscript. I apologise to all my colleagues whose work could not be cited because of space constraints. Work in my laboratory was funded by the Canadian Institute of Health Research (#243400), the Canada Research Chair programme, Genome Canada/Genome Quebec, the Doris Duke Charitable Foundation (#2012126) and the Montreal Heart Institute Foundation.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.