Article Text
Abstract
Recent advances in next-generation sequencing technologies have brought a paradigm shift in how medical researchers investigate both rare and common human disorders. The ability cost-effectively to generate genome-wide sequencing data with deep coverage in a short time frame is replacing approaches that focus on specific regions for gene discovery and clinical testing. While whole genome sequencing remains prohibitively expensive for most applications, exome sequencing—a technique which focuses on only the protein-coding portion of the genome—places many advantages of the emerging technologies into researchers' hands. Recent successes using this technology have uncovered genetic defects with a limited number of probands regardless of shared genetic heritage, and are changing our approach to Mendelian disorders where soon all causative variants, genes and their relation to phenotype will be uncovered. The expectation is that, in the very near future, this technology will enable us to identify all the variants in an individual's personal genome and, in particular, clinically relevant alleles. Beyond this, whole genome sequencing is also expected to bring a major shift in clinical practice in terms of diagnosis and understanding of diseases, ultimately enabling personalised medicine based on one's genome. This paper provides an overview of the current and future use of next generation sequencing as it relates to whole exome sequencing in human disease by focusing on the technical capabilities, limitations and ethical issues associated with this technology in the field of genetics and human disease.
- Molecular genetics
- cancer: CNS
- paediatric oncology
Statistics from Altmetric.com
Introduction
Discoveries made in the 20th century have helped to completely reshape all fields of biomedical studies as we know them. The revolution we are currently witnessing was triggered by the discovery of the DNA double helix in 19531 2 which has enabled major advances in genetics and heredity. Advances in knowledge have often been driven by the advent of new technologies. PCR was discovered in 19833 and revolutionised our approach to the study of DNA, and that in turn revolutionised the molecular analysis of mammalian genes. In 1977 two landmark articles describing methods for DNA sequencing were published.4 5 The approach reported by Sanger and colleagues was further refined and commercialised leading to its dissemination throughout the research community and, ultimately, into clinical diagnostics. In an industrial high-throughput configuration, Sanger technology was then used in the sequencing of the first human genome which was completed in 2001 through the Human Genome Project, a 13-year effort with an estimated cost of $2.7 billion.6–8 In 2008, by comparison, a human genome was sequenced over a 5-month period for approximately $1.5 million9 and now, in 2011, sequencing a whole genome is done over a period of a few days and is soon projected to cost less than $10 000. The latter accomplishment was made possible by the commercial launch of the first massively parallel pyrosequencing platform in 2005, which ushered in the era of high-throughput genomic analysis now referred to as ‘next-generation sequencing’ (NGS). The time and cost needed is expected to fall even further in the very near future, making these technologies available to researchers without large budgets. Vast amounts of genotype and phenotype data are now being generated by a growing number of research efforts. Interpreting all these new data and translating the findings to practical healthcare is a challenge. This review focuses on what whole exome sequencing (WES) using NGS can teach us about human disease, starting with single gene disorders and moving on to more complex genetic disorders including complex traits and cancer. We overview the current state of the technology, its practical current and future uses, the reasons to use WES instead of whole genome sequencing (WGS) and some of the ethical dilemmas arising from the impact of the results on society and on clinical practice.
Next-generation sequencing (NGS) technologies and experimental approaches for whole exome sequencing (WES)
NGS platforms share a common technological feature—namely, massively parallel sequencing of clonally amplified or single DNA molecules that are spatially separated in a flow cell. This design is a paradigm shift from that of Sanger sequencing and has allowed scaling-up by orders of magnitude. In NGS, sequencing is performed by repeated cycles of polymerase-mediated nucleotide extensions or, in the case of ABI SOLiD, by iterative cycles of oligonucleotide ligation. As a massively parallel process, NGS generates hundreds of megabases to gigabases of nucleotide sequence from a single instrument run, depending on the platform. Targeted sequencing approaches have the general advantage of increased sequence coverage of regions of interest—such as coding exons of genes—at lower cost and higher throughput compared with random shotgun sequencing methods.9–12 Most large-scale methods for targeted sequencing use a variation of a hybrid selection approach. Complementary nucleic acid ‘baits’ are used to ‘fish’ for regions of interest in the total pool of nucleic acids, which can be DNA12–15 or RNA.16 Any subset of the genome can be targeted, including exons, non-coding RNAs, highly conserved regions of the genome, disease-associated LD blocks or other regions of interest. Because the exome represents only approximately 1% of the genome or about 30 Mb, vastly higher sequence coverage can be readily achieved using second-generation sequencing platforms with considerably less raw sequence and cost than WGS. For example, whereas 90 Gb of sequence is required to obtain 30-fold average coverage of the genome, 75-fold average coverage is achieved for the exome with only 3 Gb of sequence using the current state-of-the-art platforms for targeting.17 18 However, there are inefficiencies in the targeting process. For example, uneven capture efficiency across exons can result in exons with low sequence coverage, and off-target hybridisation means that at least 20% of reads come from genomic DNA outside the exome. In addition, exome capture is not complete. Indeed, the probes in sequence capture methods are designed based on information from gene annotation databases such as the consensus coding sequence (CCDS) Database and RefSeq Database. Therefore, unknown or yet-to-be-annotated exons, evolutionary conserved non-coding regions and regulatory sequences such as enhancers or promoters are not typically captured. Partly to address these issues of coverage, the latest commercially available capture kits provide nearly complete coverage of the well-annotated genes but also allow the user to add custom content by designing capture probes targeted to additional regions of interest such as promoters or highly conserved sequences. Newer kits have also expanded the regions captured to include micro-RNA sites and untranslated regions of genes, thus increasing the regions captured from ∼30 Mb to as high as 62 Mb (table 1).
An important consideration is that NGS technologies have higher base calling error rates than Sanger sequencing, although this can be remedied to some extent by increasing the depth of sequencing coverage to ensure minimal ‘false calls’.19 This makes the resequencing of mutant or variant genes using conventional sequencing techniques important for validation and increases the cost of the approach. All of these inefficiencies are likely to be ameliorated as sequencing and capture technology continue to improve. Importantly, the higher coverage of the exome that can be affordably achieved for a large number of samples makes exome sequencing highly suitable for mutation discovery and its use is becoming increasingly routine.
WES in human disease
“Why me?” “Why my child?” “What could I have done to avoid this?” “What can I do to be cured or to get better?” These questions routinely face clinicians caring for patients with a given disease, and they are even questions we ask ourselves as medical knowledge expands with the rapid advances in genome sequencing. They pertain to any human disease as all have a genetic component—major or minor—or, for some, they simply reflect the wish to know what genetic information they are born with. Until 2008 such questions usually remained unanswered because, even for some Mendelian disorders, we did not know how to identify causative mutations within our genome. Indeed, our individual genomes contain variants which may protect us against or increase our susceptibility to our environment and the multiple stressors we encounter during our lifespan. Knowing these variants can perhaps allow us to better prepare or to avoid the negative impacts they might have on our health, lifespan and offspring. In the short years since the first commercial platform became available, NGS has dramatically accelerated multiple areas of genomics research, enabling experiments that previously were not technically feasible or affordable. In this paper we describe the major ongoing applications of NGS as they pertain to WES.
Genetic variants identified using WES
Genetic variants induce a phenotype that can vary between individuals in penetrance or physiological effect and may depend on (1) environmental factors; (2) modifier genes and/or the epigenome; and (3) the additive/synergistic effect from another genetic variant (digenic inheritance).20 High-penetrance variants induce a strong physiological effect and thus these alleles have usually been identified as causative for Mendelian monogenic disorders using linkage studies in families (see below). Low-penetrance variants have a weak phenotype and causative alleles are typically identified in large case/control cohorts as part of the study of complex trait disorders. Common high-penetrance disease variants are rare as they would normally be eliminated from breeding populations, except in cases of balancing selection where heterozygotes have an advantage over homozygotes (eg, sickle cell disease and resistance to cerebral malaria). The distinction between monogenic and complex diseases is therefore operational as all genetic variants are de facto transmitted in a Mendelian fashion and thus amenable to discovery using WES, giving this technique far wider applicability than simply the study of rare monogenic disorders (figure 1).
WES in characterising monogenic (Mendelian) disorders
Uncovering genetic defects underlying monogenic inherited disorders is one of the most obvious applications of WES. Single-gene disorders, while individually rare, are in aggregate numerous and have an enormous impact on the well-being of affected patients. To date, the gene responsible for the disease in more than 3000 Mendelian disorders has not yet been uncovered. Although rates of spontaneous mutation in the human genome have been estimated in various ways, it is clear that worldwide the entire human gene repertoire is bombarded with new pathogenic alleles each year.21 The number of known mutations in human nuclear genes that underlie or are associated with human inherited disease now exceeds 100 000 in more than 3700 different genes (Human Gene Mutation Database). However, for a variety of reasons this figure probably represents only a small fraction of the clinically relevant genetic variants in the human genome.
NGS brings new ways of addressing monogenic disorders. Classical strategies involved using linkage analysis in families with known shared genetic heritage, identifying candidate genomic regions enclosing the gene with the causative mutation, narrowing the interval whenever possible with additional families/probands and thereafter either implementing a candidate gene approach or systematically sequencing the genes located within the interval. The advent of single nucleotide polymorphism (SNP) arrays, which can identify regions of homozygosity within a genome, helped significantly to hasten the linkage analysis by narrowing the regions of interest for further directed sequencing. However, these approaches are costly and time-consuming and their success in identifying causative genetic variants has been variable, mainly due to the small numbers of available affected individuals for a given Mendelian disease and possibly also due to locus heterogeneity.22–25 Deep resequencing of all human genes for discovery of allelic variants could potentially identify the gene underlying any given rare monogenic disease where a shared genetic heritage is not readily available.11
Protein-coding genes constitute only approximately 1% of the human genome but harbour about 85% of the mutations with large effects on disease-related traits. Indeed, most Mendelian disorders are caused by exonic mutations or splice-site mutations that change the amino acid sequence of the affected gene. In contrast to the more laborious approach of SNP homozygosity mapping, the exome sequencing approach is faster, does not depend on shared allelic heritage and can be done in the presence of allelic heterogeneity. Instead, its success depends on the mutation being present in the captured portion of the genome and on our ability to identify it as a pathogenic variant among the many thousands of new variants detected in each exome (the ‘background noise’). Strategies to identify these mutations using exome sequencing generally rely on certain assumptions. First, homozygous or heterozygous mutations in only a single gene are required to cause the disease and these mutations will be extremely rare (eg, only present in affected individuals). Second, these mutations have a large effect size, are highly penetrant and, as such, are assumed to affect the protein sequence (ie, non-synonymous SNPs, insertions/deletions or splice-site mutations). The main strategy employed to identify causative mutations is therefore to find all variants in the exome and apply various filters based on the assumptions above. Additional filters are also required to remove false positives caused by systematic, sequencing and misalignment errors. However, the development of bioinformatics analysis tools and the availability and rapidly decreasing cost of NGS technology render exome sequencing simpler and faster than homozygosity mapping. Both alternatives are still complementary as homozygosity mapping can allow us to focus rapidly on the variants most likely to be causative, which can now be identified in record time and with a small number of affected individuals using WES, a cost-effective, reproducible and robust strategy.
The reported successes using NGS have increased exponentially since its first application in 2008 and have frequently been achieved using a limited number of patients.26–28 Identification of genetic defects in autosomal recessive diseases was performed using either unrelated individuals10 29–34 and/or individuals from the same families35–42 and was, in some instances, coupled to homozygosity mapping.43–45 Similar success was achieved for autosomal dominant disorders.46 47 WES has even been able to identify the causative mutation in diseases with genetic and phenotypic heterogeneity.30 34 48 49 Such heterogeneity would make identification of the causative mutation very difficult, if not impossible, by traditional linkage-based approaches. In two recent publications WGS was used to identify the causative gene, but it should be noted that in both instances the investigators identified the genetic alterations in the exome.50 51 Thus, in total since 2009, more than 20 causative genes have been identified and this number is growing exponentially. With the availability of this technology, several initiatives in North America through the National Institute of Health (NIH), Finding of Rare Disease Genes in Canada (FORGE) and Rare Disease Consortium for Autosomal Loci (RaDiCAL) have been launched which aim to characterise Mendelian disorders. The challenge is now how to validate, among the multiple variants that will be identified, the causative alteration and link it to disease and function. Moreover, it is likely that we will identify novel phenotypes for a known gene, or the reverse, as different hypomorphic mutations can lead to distinct phenotypes52–54 and identical mutations in a given gene have been shown to induce distinct phenotypes.55 In addition, in highly consanguineous probands, the compounded effect of added genetic alterations can affect the phenotype and lead to what has been identified as a novel disease.
Paradigm shift brought by WES in the identification of de novo mutations
A remarkable demonstration of how powerful WES using NGS can be in teaching us about human disease comes from a recent study that provides evidence that de novo single nucleotide variants may contribute substantially to mental retardation.56 Investigators explored a major paradox in evolutionary theory—namely, that the per-generation mutation rate in humans is high despite the allelic loss due to reduced fertility. They postulated that these de novo mutations may compensate for allele loss in common neurodevelopmental and psychiatric diseases, and explain this paradox in evolutionary genetic theory. They used a family-based WES approach to test this de novo mutation hypothesis in 10 individuals with unexplained mental retardation. Following WES of their trios (parents and proband), they identified and validated unique non-synonymous de novo mutations in nine genes. While they further establish the power of NGS/WES in identifying the genetic basis of human diseases, their findings also provide strong experimental support for a de novo paradigm; de novo point mutations of large effect sizes together with de novo copy number variation could potentially explain the majority of all cases of mental retardation in the population.56 In addition, in cases where the results from the analysis of WES are not conclusive in identifying a causative gene and/or in the scenario where only one affected individual within the family is available for studying a rare disorder, resequencing trios (parents and the patient) can help to pinpoint the causative genetic variant by excluding mutations shared with the parents.
WES in characterising complex trait disorders
Genome-wide association (GWA) studies of complex traits have been successful in identifying common variant associations but have failed to explain most of the heritability of these traits.57 The field of complex trait genetics is shifting towards the study of low-frequency (minor allele frequency (MAF) 0.01–0.05) and rare (MAF <0.01) variants, some of which are hypothesised to have larger effects. Indeed, GWA studies, which so far have focused on very common SNPs, have been completed for most common human diseases and many related traits.58 59 These studies have been designed based on the knowledge of most of the very common gene variants (MAF >∼5%) in the human genome and have identified over 500 independent strong SNP associations (p<1×10−8) (see the National Human Genome Research Institute Catalog of Published GWA Studies). However, most of these associated SNPs have very small effect sizes and the proportion of heritability explained is at best modest for most traits.57 60 Furthermore, most GWA signals have yet to be tracked to causal polymorphisms. Fine mapping and functional evaluation of these loci is an ongoing process with several successful examples indicating that the causal variants must have subtle regulatory effects. Although the systematic identification of rare variants associated with common diseases has not yet been feasible, several rare variants have nevertheless been identified that confer a substantial risk of disease. For example, autism, mental retardation, epilepsy and schizophrenia have been shown to be influenced by rare structural variants that affect genes.61 Additionally, it seems possible that some—perhaps even many—of the current GWA signals could reflect the effect of one or more rare variants that have been tagged by common variants.62 Whatever underlies the GWA signals for common diseases, it is clear that GWA studies of common variants have limited value in disease prediction.
As discussed above, past results make a strong case for common diseases being more similar to Mendelian diseases than is postulated by the common disease–common variant model. It seems possible that much of the genetic cause of common diseases is due to rare and generally deleterious variants that have a strong impact on the risk of disease in individual patients, and which can now be identified thanks to NGS. Because the most obvious disease-influencing variants will be the clearly functional ones, WES has the potential to identify these rare variants and allow definitive connections to be rapidly established between specific genes and many important common diseases. However, there are drawbacks to this technique, the most important being that it almost entirely misses structural variation. WES, despite its name, also misses a certain set of exons: if causal variants lie within these exons that are not targeted (we found that 5–10% of RefSeq exons or ∼3% of RefSeq coding exons have <5× coverage in the latest commercial capture kits), they will not be identified, as is also the case for monogenic disorders. Additionally, the capture methods currently used require the sequencing of a far greater number of bases than expected based on the size of the exome, which makes WES prices comparable to those of low-coverage WGS. However, low-coverage sequencing will miss many of the variants present in only a single individual. For these reasons, high-coverage WES is the method of choice for complex traits as it is becoming more affordable, and there is a rapid increase in the sequencing capacity of existing platforms as well as the development of new less-expensive platforms.
An essential point to improve the chances of success is to carefully choose cases and controls for each study as costs restrict sample size. Selecting cases with a strong family history will increase the probability of finding pathogenic variants with large effects. The availability of large control cohorts who can be recalled and phenotypically evaluated will also be crucial. Unlike variants with weak influences which could be expected to appear in the population at large without much phenotypic effect, a variant of strong effect is less likely to appear in an individual without any phenotypic consequence. Confirmation of such potentially causal variants will therefore often require the careful evaluation of the phenotype in any controls who are carriers. Several complex diseases are currently being investigated using NGS and include mental diseases, diabetes and autoimmune disorders such as lupus and inflammatory bowel diseases. The results from these studies have the potential to revolutionise the screening of these disorders and our therapeutic approach. The low-frequency and rare variants are likely to be population-specific at a much finer scale than common variants, so careful geographical and molecular matching of cases and controls is that much more important than in GWA studies. An alternative strategy is to do gene-based rather than variant-based analysis, which will require sequencing rather than genotyping in the replication cohorts. The pros and cons of these alternatives are still being debated.
WES in characterising cancer
The application of NGS technologies is allowing substantial advances in cancer genomics. Indeed, the development of massively parallel sequencing technologies makes it feasible to catalogue all classes of somatically-acquired mutations in a cancer.63–65 It has become feasible to sequence all expressed genes (the transcriptomes),66 67 exomes and, more recently, complete genomes64 68–70 of cancer samples. WES with capillary sequencing allowed the analysis of all known coding genes in colorectal, breast and pancreatic carcinomas and glioblastoma.71–73 These studies have led to the discovery of somatic mutations in isocitrate dehydrogenase 1 in glioblastoma72 and of germline mutations in PALB2 (the gene encoding partner and localiser of BRCA2) in patients with pancreatic carcinoma,74 among other important findings. In addition, the hybrid selection approach will be particularly powerful for diagnostic analysis of the cancer genome; for diagnosis, there may be value in sequencing specific oncogenes and/or tumour suppressor genes at very high coverage in samples with a low percentage of tumour cells.75 However, a major challenge of cancer genome analysis is to identify ‘driver’ mutations,65 and several recent genome studies of leukaemias, myelomas and solid tumours including breast, lung and pancreatic cancer have concentrated their analysis on coding regions (exomes) to increase the likelihood of identifying driver mutations76–79 or used integrative genomics approaches (mapping of structural variation, whole genome methylation and gene expression analysis) in association with NGS techniques.72 Whole exome analysis on these distinct subtypes provides a better understanding of mechanisms underlying specific cancers and also identifies new biomarkers and/or drug targets, as recently reported, for example, in individuals with acute myeloid leukaemia.68 80 81
WES is thus opening new avenues towards understanding the molecular pathogenesis of cancers. For example, the discovery of DNMT3A (a gene involved in DNA methylation) mutations in acute myeloid leukaemia (AML) may imply that aberrant epigenetic regulation is critical for pathogenesis, but the exact link—whether it be altered gene expression or genome instability—has yet to be uncovered. Other key questions include what aspects of leukaemia biology can be attributed to mutations in this gene and why it is concentrated in specific AML subtypes and associated with a poor prognosis. Thus, additional genetic and/or epigenetic events can modulate this disease type and must be uncovered through further exploration of the genome and the epigenome. WGS is also being used to investigate cancers, with an initial focus on very specific subgroups in certain cancers to minimise the confounding effect of genetic heterogeneity.70
Other innovative approaches are making use of NGS in targeted exome/genome sequencing for the design of cost-effective targeted sequencing methods to the benefit of personalised chemotherapy.82 In another recently published study, targeted NGS detected point mutations, insertions, deletions and balanced chromosomal rearrangements and identified novel leukaemia-specific fusion genes in a single procedure combining 454 shotgun pyrosequencing with long oligonucleotide sequence capture arrays.83 In yet another study, NGS was applied as a screening method to characterise a number of known genetic alterations in chronic myelomonocytic leukaemia and identified that a pattern of molecular mutations translated into distinct biological and prognostic categories.84
There are unique methodological considerations in NGS analyses of cancer samples (reviewed by Meyerson et al85). Cancer samples and cancer genomes have general characteristics that are distinct from other tissue samples and from genomic sequences that are inherited through the germ line. Cancers themselves may be highly heterogeneous and composed of different clones that have different genomes.86 Cancer genomes are enormously diverse and complex and have major structural variability. They vary substantially in their sequence and structure compared with normal genomes and among themselves. To identify somatic alterations in cancer, comparison with matched normal DNA from the same individual is essential. This is largely because of our incomplete knowledge of the variations in the normal human genome; to date, each ‘matched normal’ cancer genome sequence has identified large numbers of mutations and rearrangements in the germ line that had not previously been described.
Relevance for the clinical use of WES
Gene discovery is an essential starting point for both understanding the genetic mechanisms underlying diseases and for providing clues to therapeutic approaches. Gene-specific treatments are currently ongoing worldwide, and several successful gene therapy trials aim to correct inborn errors for diseases such as immune deficiencies, metabolic disorders and, more recently, thalassaemia.87–93 Local delivery of the replacement gene is also being tested in human clinical trials for several forms of hereditary blindness such as Leber congenital amaurosis and retinitis pigmentosa.94 Also, genetic testing for common mutations in recessive disorders such as Tay-Sachs disease has proved to be of benefit both for diagnosis and carrier detection.5 For complex traits, understanding the genetic alterations in disease variability and resistance to treatment in a given individual could revolutionise care and may soon make the concept of personalised medicine a reality. WES is paving the way to identifying driver mutations in cancer as well as the genetic events leading to metastasis, the primary cause of cancer mortality, and which are potentially amenable to therapeutic targeting. These insights will provide improved means to prevent recurrence and to avoid therapeutic resistance. WES can also complement histopathological analysis by allowing for more accurate diagnosis and improved subgrouping of patients.95 The clinical applications are thus enormous. With regard to personal genomics, numerous companies already use SNP arrays to offer predictions of common disease risk directly to consumers, which can influence lifestyle choices and decisions to use relatively non-invasive monitoring programmes (eg, imaging). Genome sequencing will greatly improve the specificity of such predictions and adds the ability to detect novel variants, and might lead to an expansion of fetal screening.
Why not use WGS?
WGS is being increasingly used based on its availability and improved cost efficiency. Indeed, in the future, WGS is predicted to be more economical than WES because the capture process is skipped entirely. This technique has the advantage of capturing all of the exome (as some can be missed by the exome capture process), and can provide information on variants in highly evolutionary conserved non-coding regions and other variants throughout the genome. In addition, WGS using a paired-end approach can be used to detect large structural variants such as large insertions or deletions, inversion and translocations. As the cost of WGS continues to decrease, it will become increasingly popular because of its ability to survey most of the genome as well as additional classes of mutations. However, the amount of data generated from WGS is 100 times more than the already overwhelming amount obtained by WES. The bioinformatics filtering techniques, storage facilities, software and hardware for data analysis will prove a challenge and most ongoing projects initially focus on the exome for the first analysis. It should be noted that WGS is not immune to some of the drawbacks of exome sequencing. There is significant variability in sequencing efficiency across the genome and the fluctuations in coverage will result in many regions of interest being missed. Also, repetitive regions—exonic and others—are difficult to align in either case and can result in missed variants or an excess of variant calls. These problems can be resolved in the future with more uniform library construction, higher sequencing depth and longer reads and paired end fragment sizes. However, in the foreseeable future we are likely to continue facing some gaps in genome coverage. Unless analyses will specifically focus on non-coding regions or on structural variation, WES provides most of the benefits of WGS but with lower costs, both for sequencing and for storage and analysis of the data.
Some practical considerations
Large throughput genomic data analysis has traditionally been the domain of the bioinformatician and statistician. Laboratory researchers have gradually been embracing genomic technologies such as microarrays and applying them as ‘hypothesis-free’ discovery tools to be followed up by focused experimentation. This type of experimentation was usually accompanied by collaboration with skilled data analysts or the development of custom analysis software. NGS data will undoubtedly become a household item in the near future. What can a researcher or a clinician embarking on a WES or WGS adventure expect to obtain from the sequencing service provider?
A number of excellent commercially-available targeted sequence capture kits are available including kits from Agilent (Santa Clara, CA, USA), Illumina (San Diego, CA, USA) and Nimblegen (Madison, WI, USA). Once captured, the DNA fragments need to be sequenced. Currently, ABI (Carlsbad, CA, USA) and Illumina are the two major companies in the sequencing field, but a number of third-generation sequencers are being developed and may enter the field in the near future (table 2). After the sequencing process, individual sequence reads are typically aligned to the reference genome sequence. Next, variant positions are identified between the sample of interest and the reference genome. At this point a number of bioinformatic filtering steps are required to separate common benign polymorphisms from potentially deleterious mutations.
We believe that, while the choice of the optimal technology and analytical pipelines is important, it is secondary to the service provider's experience with the specific technology and willingness to engage in some back-and-forth dialogue with the researcher on custom analysis needs. As an example, at the McGill University and Genome Quebec Innovation Centre, we have to date processed the data from over 300 exomes. This experience is instrumental in identifying systematic false positive and false negative results. False positives most often arise from incorrect mapping and systematic sequencing errors—for example, certain words (combination of nucleotides) being systematically misread by the sequencer. Both of these errors can be removed by comparing each test sample against previously sequenced exomes. Systematic errors occur over and over again but, if they are present in a certain proportion of all sequenced samples, they can be easily removed from the final list of variants. False negative results can result from low overall coverage, poor capture efficiency of certain regions and difficulty in unambiguously aligning repetitive regions. Such missing regions can easily be flagged and reported to the researcher who may want to follow them up by targeted sequencing.
The final output for each sample is a list of variants that can be easily manipulated in a spreadsheet. In our experience, each sample produces roughly 500 potentially ‘interesting’ protein-coding variants—that is, those that have not been seen in more than 5% of other exomes. Our annotated output file contains basic information on the chromosomal position, nucleotide change, predicted protein change, gene name and gene description. Further annotation includes Online Mendelian Inheritance in Man (OMIM) entry (if available), Scale-invariant feature transform (SIFT)96 prediction of how likely the change is to be damaging, interspecies conservation of each residue, dbSNP entry and allele frequency from the 1000 Genomes Project.97 We also provide information on the sequencing quality of each variant and clickable links to the primary sequencing data visualised in the Integrative Genomics Viewer98 and the graphic display of each position in the UCSC Genome Browser.99 The end user who is interested in a recessive disease and studies a consanguineous family, for example, can then apply simple spreadsheet filtering functions to display only homozygous changes that have never been seen before or that have very low minor allele frequencies. In most cases this will limit the final list to a manageable number of a dozen or fewer candidate variants that can then be followed up manually.
Of course, as the number of samples and complexity of a disorder increases, at some point a switch from a spreadsheet to a dedicated bioinformatician may be necessary. However, for Mendelian disorders a spreadsheet-savvy researcher should be quite successful in analysing exome sequencing results.
Some final rules of thumb: choose a friendly experienced sequencing centre; longer reads are better than shorter reads as they reduce false positives from mapping ambiguity; paired-end reads are better than single-end for the same reason; and 30× median coverage of the target may be sufficient, but 100× coverage is much safer as it ensures that variants can confidently be determined across a higher proportion of the exome.
Ethical issues raised by WES
The increased ability to share large amounts of individual-specific genetic information across borders puts a new twist on perennial ethical issues such as consent, feedback, protection of privacy and the governance of research.100–103 Informed consent is needed from participants in research and has been a guiding principle of medical investigation since the mid-20th century. This conception of consent, along with the concomitant power to withdraw from research without prejudice, arose originally in the context of biomedical research. It had the aim of protecting participants from abuse and from potential physical harm, and focused on clinical interventions and the collection of samples rather than on data collection per se. From its inception, informed consent was strongly concerned with the protection of individuals.
Genomics research, however, moves away from these origins on several counts. The information that is derived from DNA is a powerful personal identifier and can provide information—not just on the individual but also on the individual's relatives and ethnic groups—in a format that is easy to share across international borders. Although samples and data have personal identifiers removed, individuals may still be re-identifiable because of the richness of the data derived from the analysis. The data produced are often shared informally among researchers, but more formal mechanisms have been put in place by funders to ensure the rapid sharing of NGS data, such as the requirements to deposit data sets in open access archives.104 Examples are the European Genotype Archive and dbGaP (NIH-USA). The complexity of genomics research, together with the difficulty of providing precise specifications for future use of data, have prompted serious concerns about whether any consent to such research can be adequately ‘informed’. There is a pressing need to learn from insights gained elsewhere, such as in genetic counselling and in family studies. Likewise, calls to involve the community in consent pose ethical issues about individual and group rights which may be different for communities across the globe.
Reporting findings back to the participants may be considered to be an important part of building and maintaining public trust in research.102 105–107 Providing participants with information about the general findings of research, such as publications based on the research, is an uncontroversial and welcome practice. In contrast, informing a single individual of his or her results remains controversial in many areas of research and particularly in the area of WGS. There appears to be some agreement that, where there is a serious treatable condition, researchers have a moral obligation to feed this information back to research participants.108 In cases where findings are of a less serious nature, are untreatable or of uncertain significance, the potential benefits for participants of being informed need to be balanced against the participant's right not to know. The thoughtful handling of such issues is of clear relevance to the maintenance of public trust in the research process and is the subject of ongoing studies in collaboration with some of the initiatives exploring WES/WGS in rare diseases and cancer in North America. Not surprisingly, an ethical investigation component has been added to each of these initiatives to try and assess the impact of the findings on families and to determine ways for appropriate return of information and protection of privacy.
Conclusions and future directions
Vast amounts of clinical, biological and sequencing data are now being generated by an expanding number of research efforts on a scale that was only imagined just a few years ago. Interpreting these data and translating the findings to improve healthcare is a challenge in itself. In addition to developing locus-specific databases and large data warehouses for NGS datasets, there is a major need to create dedicated databases to enhance the clinical interpretation process. Indeed, the development of analysis techniques to cope with the millions of variants called per genome will be a high priority, as will the development of techniques that can combine data about different rare variants into one analysis. The advances we have described highlight the important implications that particular mutations, discovered through NGS and WES, can have for medical management and for tailoring therapy to the genetic background of a given individual in a vast array of diseases. Furthermore, definitive connections—for example, a clearly functional mutation in a single gene conferring a strongly increased risk of a disease—would provide validated therapeutic targets for the pharmaceutical industry and genetic discovery could be the most likely avenue for ameliorating the ongoing crisis in global drug development. This prediction assumes that rare variants will be found that have large influences on rare and common diseases, that their biological functions will be obvious and that locus and allelic heterogeneity will not prevent insights into the mechanisms of disease. How often these assumptions will hold is currently unknown and will largely determine the rate of discovery in the coming years.
References
Footnotes
Funding This work was supported by the McGill University Health Center Research Institute. NJ is the recipient of a Chercheur Boursier award from Fonds de Recherche en Santé du Quebec. JM is a recipient of a Canada Research Chair. EL is funded by the Canadian Institute for Health Research.
Competing interests None.
Patient consent Obtained.
Provenance and peer review Not commissioned; internally peer reviewed.