Article Text
Abstract
Background Intellectual disability is a very complex condition where more than 600 genes have been reported. Due to this extraordinary heterogeneity, a large proportion of patients remain without a specific diagnosis and genetic counselling. The need for new methodological strategies in order to detect a greater number of mutations in multiple genes is therefore crucial.
Methods In this work, we screened a large panel of 1256 genes (646 pathogenic, 610 candidate) by next-generation sequencing to determine the molecular aetiology of syndromic intellectual disability. A total of 92 patients, negative for previous genetic analyses, were studied together with their parents. Clinically relevant variants were validated by conventional sequencing.
Results A definitive diagnosis was achieved in 29 families by testing the 646 known pathogenic genes. Mutations were found in 25 different genes, where only the genes KMT2D, KMT2A and MED13L were found mutated in more than one patient. A preponderance of de novo mutations was noted even among the X linked conditions. Additionally, seven de novo probably pathogenic mutations were found in the candidate genes AGO1, JARID2, SIN3B, FBXO11, MAP3K7, HDAC2 and SMARCC2. Altogether, this means a diagnostic yield of 39% of the cases (95% CI 30% to 49%).
Conclusions The developed panel proved to be efficient and suitable for the genetic diagnosis of syndromic intellectual disability in a clinical setting. Next-generation sequencing has the potential for high-throughput identification of genetic variations, although the challenges of an adequate clinical interpretation of these variants and the knowledge on further unknown genes causing intellectual disability remain to be solved.
- Intellectual disability
- de novo mutation
- sequence analysis
- genetic diseases
Statistics from Altmetric.com
Introduction
Intellectual disability (ID) is one of the most common disorders, affecting around 1 in 50 individuals, and can be caused by a variety of environmental and genetic factors.1 ,2 This is not unexpected taking into account that the central nervous system is probably the more complex organ in humans, an intricate cell-based network where the number of genes controlling its development and functioning, not fully understood yet, must be huge. Thus, it is not surprising that the number of genes causing ID when mutated is still unknown. This genetic heterogeneity, together with the fact that none of the genes associated with ID shows a high prevalence, complicates the task of determining the molecular aetiology of ID.3 ,4 The extreme genetic heterogeneity makes that the genetic diagnosis based on Sanger sequencing, a rather expensive and time-consuming technique, to be of limited utility in terms of diagnostic yield. Consequently, the need for new methodological strategies for the efficient study of multiple genes and to detect a greater number of mutations is imperative.
In recent years, with the emergence of high-throughput technologies such as array comparative genomic hybridization (CGH), there have been significant advances in the knowledge on the genetic causes of ID, which has assisted in disease diagnosis and other clinical practices.5 ,6 More recently, massive parallel sequencing approaches, also known as next-generation sequencing (NGS), allow performing multiple simultaneous analyses from small volumes of samples in a cost-effective manner.7
In order to address the genetic (and phenotypic) heterogeneity of ID and to assess the diagnostic yield of this new technology on a clinical setting, we developed a panel of 1256 genes for diagnosis of syndromic ID. Here, we present the findings after applying this panel to 92 families from our cohort of idiopathic syndromic ID.
Methods
Patient recruitment
The study was carried out on patients from our cohort of syndromic ID of unknown cause, previously reported.8 The patients were recruited for genetic investigation of unexplained ID associated with multiple congenital anomalies, dysmorphic features and/or a positive family history of ID, congenital anomalies or miscarriages. A clinical and epidemiological description of this cohort can be seen in table 1.
Description of the cohort of 92 patients with syndromic intellectual disability (ID)
The study was approved by the local ethics committee of the Hospital Universitario y Politecnico La Fe (Valencia, Spain). Informed consent was obtained from the parents of all the participants.
Sample preparation
Genomic DNA was extracted from peripheral blood by standard methods. The purity and concentration of the samples were checked using a NanoDrop 8000 spectrophotometer (Thermo Scientific) and a Qubit 2.0 fluorometer (Invitrogen), respectively.
Gene selection and panel design
A systematic search for genes implied in ID when mutated was performed in different databases, basically in OMIM (http://www.ncbi.nlm.nih.gov/omim/), PubMed (http://www.ncbi.nlm.nih.gov/pubmed/) and Gene (http://www.ncbi.nlm.nih.gov/gene/). Additionally, candidate genes were selected based on their functional or predicted relationships with known pathogenic genes and/or for being located in critical regions of microdeletion syndromes which lack known dose-dependent gene (positional candidates). These positional candidates were based both on our own results of array-CGH screening in the cohort of patients mentioned above and on candidate genes reported elsewhere. A total of 646 pathogenic genes known to cause ID and 610 candidate genes were included in the design (see online supplementary tables S1 and S2). In addition, all the enhancer elements located inside or in close proximity to these genes (http://genome.lbl.gov/vista) and the ultraconserved elements of the human genome were included in the design. Genomic coordinates of both kind of elements were obtained from the University of California, Santa Cruz (UCSC) (http://genome.ucsc.edu) based on human genome build GRCh37/hg19.
supplementary tables
Capture, NGS and analysis pipeline
A custom SureSelect oligonucleotide probe library was designed to capture 19 878 coding exons or presumed regulatory elements of the 1256 genes. The design includes all the transcripts reported for each target gene in different databases (RefSeq, Ensembl, consensus coding sequence (CCDS), Gencode, vertebrate and genome annotation (VEGA)). The SureSelect DNA Standard Design Wizard (Agilent Technologies, Santa Clara, California, USA) was used for probe design with a 2× tiling density and a moderately stringent masking. A total of 71 994 probes, covering 5.073 Mb (99.48% coverage of targets), were synthesised by Agilent Technologies. Sequence capture, enrichment and elution were performed according to the manufacturer's instructions. The libraries were sequenced on an Illumina HiSeq 2000 platform with a paired-end run of 2×90 bp, following the manufacturer's protocol to generate at least a 100× effective mean depth.
Variation calling was performed with the DNAnexus platform (DNAnexus, Mountain View, California, USA) through the following pipeline: Fasq paired reads were aligned to the reference human genome UCSC hg19 using the Burrows-Wheeler Alignment with Maximal Exact Matches (BWA-MEM) algorithm from the BWA software package. Mappings were deduplicated using Picard, realigned around sites of known indels, and their quality was recalibrated by looking at covariance in quality metrics with frequently observed variation in the genome. After recalibration, variants were called with the GATK Unified Genotyper module. Variants on regions with low read depth (≤10) or with low allelic ratio (≤0.32) were filtered out. Annotation of nucleotide variants was performed by the Ion Reporter Software (Life Technologies, Carlsbad, California, USA).
Assessment of the pathogenicity of variants
To evaluate the putative clinical impact of the variants, the following criteria were applied: (1) an allele frequency <0.01 in the 1000 g or exome variant server (EVS) databases (2) the stop gain, frameshift and splicing variants were a priori considered to be most likely as pathogenic; (3) for missense mutations, amino acid conservation and predictions of pathogenicity (sorting intolerant from tolerant (SIFT), PolyPhen2 and Grantham) were evaluated; (4) a de novo occurrence (dominant inheritance), the presence of two mutant alleles in the same gene, each from a different parent (recessive inheritance), or maternal inheritance of X linked variants; (5) the absence of the variant in other samples (in-house database); (6) phenotypic consistency with the clinical signs associated to mutations in the same gene (only for known ID genes). To evaluate the possible effect of synonymous or intronic variants in gene splicing, we used the Human Splicer Finding web tool.
On the other hand, the Combined Annotation-Dependent Depletion (CADD) score for prediction of pathogenicity (C-score),9 as well as the statistics for conservative restrictions in each gene obtained from the Exome Aggregation Consortium dataset (http://exac.broadinstitute.org) were employed as complementary analyses for de novo variants in the candidate genes, in order to find further support on their pathogenicity.
Validation by Sanger sequencing
Relevant genetic variants were confirmed by Sanger sequencing. After PCR amplification from DNA of the patient and his/her parents using specific primers, bidirectional sequencing was performed by using the BigDye Terminator kit V.1.1 and an ABI PRISM 3500 automated sequencer (Life Technologies). All primers for amplification and sequencing were selected with exon-primer (primers and PCR conditions are available on request).
Results
About 99.8% of the target regions were covered at ≥10× read depth and 99.1% at ≥40×, with an average read depth per base of 133×. On average, 3184 variants in known and candidate genes were called per sample. Excluding those variants with allele frequency >0.01, an average of 57 coding non-synonymous variants per sample or with a high conservation score (phyloP>2) were selected for further analyses. Filtering of variants were performed taking into consideration their inheritance—searching for de novo, compound heterozygous (one variant from each parent) or X linked hemizygous variants—as well as their putative functional effect, presence in the Exome Aggregation Consortium dataset (ExAC: http://exac.broadinstitute.org), or in mutation databases (human gene mutation database (HGMD): http://www.hgmd.cf.ac.uk; ClinVar: http://www.ncbi.nlm.nih.gov/clinvar/), expected clinical consequences, etc.
We have detected 30 different pathogenic variants in 29 patients among the 646 known genes causing ID (see table 2). All these variants were confirmed by Sanger sequencing in the patients and parents, as well as further family members when necessary. About half (52%) of these causal variants were deleterious loss-of-function mutations: nine frameshift and seven nonsense mutations. Also 13 missense changes were predicted by the algorithms for functional prediction (PolyPhen2, SIFT and Grantham scores) to be disease causing. Finally, one in-frame deletion affected one highly conserved residue in NIPBL gene and the Mutation Taster prediction was disease causing.
Pathogenic mutations detected in our cohort and the corresponding diagnosis (Syndrome)
In order to assign pathogenicity, the major factor we employed was that the phenotypes of these patients were highly concordant with the syndromes associated with other mutations in the corresponding genes. The main clinical signs present in these patients were previously reported for the corresponding syndrome (see online supplementary table S4). A high clinical correlation of the phenotype together with the segregation analyses make us fully confident on their pathogenic condition. The information concerning these mutations, including the clinical characteristics present in the patients, has been deposited in the in Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources.10
Significantly, most of these mutations occurred de novo in the patients (24/30). In addition, the c.3369_3372delTGAG variant, detected in patient 25 in ANKRD11 gene was also present at a low frequency in the paternal DNA, which was confirmed by Sanger sequencing. Therefore, the father presents a somatic (and germinal) mosaicism for this mutation with mild manifestations of KBG syndrome: borderline intelligence, hearing loss and facial features similar to those present in the patient. On the other hand, two patients (7 and 9) inherited from their mothers an X linked hemizygous mutation in the genes SLC16A2 and NSDHL, respectively. Family studies in these two cases confirmed co-segregation of the pathogenic variants with the disease.11 Finally, a recessive inheritance could only be detected in two instances: patient 20 presents the homozygous mutation p.Gly611Ser in NDST1 gene, inherited from healthy non-consanguineous parents who are both heterozygous carriers; on the other hand, patient 27 presents compound heterozygous mutations in PIGN gene (c.94_94delA and p.Pro579Ser), each inherited from a different parent.
A germinal mosaicism could be demonstrated for the variant p.Gly870Asp in UBE3A gene (patient 8). This variant was also present in his two affected half-brothers (from the same mother) and it was not detected in the genomic DNA sample of the mother and a healthy full brother. Segregation analysis with linked microsatellite markers confirmed that the three affected half-brothers share the same maternal haplotype in 15q12, not inherited by the healthy non-carrier brother. The only obvious explanation is that this variant is present as a germinal mosaicism in the mother.
False paternity and maternity were discarded in every case because most variants detected in each patient were also present in one or the other parental samples, in spite of the low or very low allelic frequency of many of these variants. On average, each patient showed 189 genetic variants not reported in the 1000 genomes or EVS databases.
A more complex situation was encountered when facing with variants causing a predicted pathogenic effect on candidate genes, that is, those genes not previously reported to cause syndromic ID. For these cases, in addition to the previous criteria, a thorough review of the literature and different databases was performed looking for: (1) absence or scarcity of polymorphic deletions (database of genomic variants; http://dgv.tcag.ca/); (2) conservative restrictions such as absence or a significant lack of loss-of-function mutations in Exome Aggregation Consortium dataset; (3) a pattern of expression and functional role of the protein concordant with the clinical signs; (4) known phenotypic consequences on animal models. After an exhaustive analysis, we found seven de novo variants probably causing syndromic ID on the genes AGO1, JARID2, SIN3B, FBXO11, MAP3K7, HDAC2 and SMARCC2 (see table 3). Given that the clinical consequences due to constitutive mutations of these genes in humans are currently unknown, these variants were classified as ‘probably pathogenic’.
Probably pathogenic de novo mutations detected in our cohort among the novel candidate genes
On the other hand, other possibly pathogenic variants were not included in this relation because of the presence of another pathogenic mutation that fully explains the phenotype, as a de novo missense mutation in gene CACNA1H in patient 12, who also presents a frameshift mutation of DYRK1A gene; because his/her clinical features do not correspond to those previously associated with mutations in the gene (such as a stop mutation in BCOR gene in patient 42); or because the genetic variant does not co-segregates with the disease, as the in-frame deletion p.His838del in the X linked gene SHROOM4, also inherited by a healthy male (patient 43). See online supplementary table S3 for a selection of this kind of variants of unknown significance, which include seven de novo variants and three X linked maternally inherited variants.
Discussion
We have achieved a minimal diagnostic yield of 32% (29/92; 95% CI 24% to 43%), which can be increased to 39% (36/92) when the probably pathogenic variants in novel candidate genes are included. A similar study by Redin et al12 of targeted sequencing of 217 genes on a cohort of 106 patients with ID led to a conclusive diagnostic yield of 25% (95% CI 16% to 32%). Their yield is slightly lower than in the present work probably because of the lower number of studied genes. Another study, performed by targeted NGS of 253 ID-associated genes in almost 1000 probands with moderate-to severe ID, could provide a diagnosis in about 11% of the individuals.13 However, this study was performed by a proband-only approach, so that a direct comparison of the diagnostic yield is rather problematic. Nevertheless, taking into account that they studied roughly half of known ID genes than this study, where they found 8% of loss-of-function mutations (the main diagnostic criteria applied in their study), this frequency is proportionately comparable to our result of 17% (16/92) diagnostic loss-of-function mutations.
It is remarkable that other studies based on whole-exome sequencing (WES) or whole-genome sequencing (WGS) yielded similar results to the present work. In a study by WES on 100 patients, de Ligt et al14 achieved a diagnostic yield of 16%. This study was published 4 years ago; in the meantime, the number of genes causing ID has increased, so that nowadays this figure would probably be higher. Another study performed by WGS on 40 patients resulted in a definitive diagnosis for 42% of them.15 Therefore, from a diagnostic point of view, the study by targeted NGS of all or most of the genes that are known to be involved in ID would be equivalent to a whole-exome study. The interpretation of presumably pathogenic variants in genes of unknown clinical effect does not imply a diagnostic improvement, although it is certainly the way for the identification of new genes and should be prioritised in the context of clinical research.
By comparison, the array-CGH analysis on our cohort previously allowed identifying the genetic cause in 26% of the patients, excluding those CNVs of incomplete penetrance or potentially pathogenic.8 Therefore, the combined use of array-CGH and NGS led to a total diagnostic yield of 50%–55% of syndromic ID in our cohort.
Another finding to note is the high genetic heterogeneity that we have found in our cohort, as 25 different genes appear mutated in 29 patients. Only three genes were recurrently mutated, suggesting a higher prevalence over the remaining: KMT2D (patients 5, 23 and 28), KMT2A (patients 15 and 26) and MED13L (patients 16 and 17). On the other hand, some clinical diagnoses were recurrently found in our series due to their genetic heterogeneity, such as Cornelia de Lange syndrome in three patients, each with a de novo mutation in the genes NIPBL, SMC3 and RAD21, respectively, or Coffin-Siris syndrome (patients 22 and 29) due to mutations in ARID1B and ARID1A genes, respectively. Conversely, some syndromes we could diagnose have been rarely reported so far, suggesting extremely rare conditions, such as the Primrose syndrome, with only seven known mutations in ZBTB20 gene,16 the recently reported ‘TAF1 syndrome’17 or NDST1 missense mutations causing autosomal recessive ID.18
The most outstanding result in this study is the high rate of de novo mutations, found in 24 patients and in 2 parents (in germinal or somatic mosaicism). This preponderance of de novo mutations has been repeatedly reported in several NGS studies on ID.11–13 ,17 ,19 For instance, de Ligt et al14 found de novo variants in 53% of the patients and provided more than 80% of the conclusive genetic diagnosis; by comparison, evidence of autosomal recessive disease was only found in one affected patient. Gilissen et al., by WGS of a prescreened cohort of 50 ID trios, equally found than most of the definitively pathogenic causes (20/21) were de novo mutations, including single-nucleotide variants as well as previously unnoticed structural variants (CNVs).15
The high rate of de novo mutations was found even among the X linked conditions, given that five out of seven causal variants found in the X-chromosome appeared de novo in four males and one female patient. Excepting several dominant X linked entities, such as Rett syndrome or Coffin-Lowry syndrome,20 this is a novel finding as studies targeted to X linked ID are usually performed on familial cases. In the present work, however, the selection of cases was predominantly based on the association of congenital anomalies with ID and most cases are sporadic.8
Other 14 de novo mutations were found: seven probably pathogenic mutations (see table 3) and seven variants of unknown significance or likely benign (see online supplementary table S3). Two patients (6 and 12) present two different de novo mutations in the investigated genes. Constitutive de novo mutations have been estimated to occur about 0.8–1.3 times per individual in the whole exome.21–23 As we have sequenced a fraction of the whole exome, the finding of 38 de novo variants (24 de novo pathogenic mutations, 7 probably pathogenic and 7 of unknown significance) supposes a significant excess. The coding sequence of the genes we have screened (3.584 Mb) represents about 5.4% of the exome, consequently 4–6.5 de novo mutations would be expected by chance in the whole series (92 patients×0.054×0.8–1.3). It is highly suggestive that this estimation fits well to the seven variant of unknown significance (VOUS)/probably benign de novo variants listed in online supplementary table S3, and further supports the probable pathogenic condition of most or all the de novo variants in the candidate genes listed in table 3. Even more suggestive is the fact that all these candidate genes are subjected to significant conservative restrictions, while all these variants affect highly conserved positions and none of them has been reported so far in the different databases. Consequently, we propose that all (or most) of the candidate genes AGO1, JARID2, SIN3B, FBXO11, MAP3K7, HDAC2 and SMARCC2 cause syndromic ID, probably with an autosomal dominant pattern of inheritance. Further studies would be necessary to demonstrate their clinical relevance.
Also it is worth noting that autosomal dominant conditions are clearly over-represented in our cohort, with a 65% of the pathogenic variants, in spite that only 24% (155/646) of the pathogenic genes included in this design present this type of inheritance, which represents a highly significant enrichment (p<0.00001; binomial test). Furthermore, this over-representation of autosomal dominant conditions in our series would be even higher if the probably pathogenic mutations in candidate genes were included. All of them aroused de novo and there was no other likely pathogenic variant in each case, even in the same gene.
These observations showed both the reliability of NGS to establish a molecular diagnosis in this genetically heterogeneous trait. The main advantage of NGS over current diagnostic methods is that a significantly higher rate of successful diagnosis can be obtained by screening both known and novel mutations in all ID genes simultaneously. Nevertheless, all existing exome sequencing kits have limitations. First, our knowledge of all protein-coding exons in the genome is still incomplete. Second, the efficiency of capture probes varies considerably, and some sequences fail to be targeted by capture probe design. Finally, there is also the issue of whether sequences other than exons, including deep intronic sequences, untranslated regions, miRNAs, promoters and other regulatory elements should be targeted.
In conclusion, we have developed and tested a panel based on NGS technology for the genetic diagnosis of syndromic ID, a very heterogeneous condition. The panel proved to be efficient, with a diagnostic yield (32%–39%) equivalent to a WES or even WGS. These results demonstrate that the developed panel is suitable and helpful to be applied in the diagnosis of ID. However, the finding of probably pathogenic conditions in novel candidate genes confirm that further research studies should be performed in order to identify all the pathogenic genes related to the neurodevelopmental disorders.
Acknowledgments
The authors would like to thank the Exome Aggregation Consortium and the groups that provided exome variant data for comparison. A full list of contributing groups can be found at http://exac.broadinstitute.org/about. Finally, we warmly thank all patients and their families for their implication in this study.
References
Footnotes
Contributors Study concept and design: FM, SoM, CO. Clinical genetics investigations: FM, MR, CO. Acquisition, analysis and interpretation of data: FM, AC-L, MR, SO, SoM, SaM, CO. Drafting of the manuscript: FM. Critical revision of the manuscript for important intellectual content: AC-L, MR, SO, SoM, SaM, CO. Obtained funding: FM, CO, MR. Administrative, technical or material support: AC-L, MR, SO, SoM, SaM, CO. Study supervision: FM.
Funding This study was supported by grant PI14/00350 (Instituto de Salud Carlos III -Acción Estratégica en Salud 2013–2016; FEDER -Fondo Europeo de Desarrollo Regional) and Fundación Alicia Koplowitz (CO).
Competing interests None declared.
Patient consent Obtained.
Ethics approval Comité Ético de Investigación Biomédica del Hospital Universitario y Politécnico la Fe.
Provenance and peer review Not commissioned; externally peer reviewed.