Article Text
Abstract
Objectives Recently, several studies documented that de novo mutations (DNMs) play important roles in the aetiology of sporadic diseases. Next-generation sequencing (NGS) enables variant calling at single-base resolution on a genome-wide scale. However, accurate identification of DNMs from NGS data still remains a major challenge. We developed mirTrios, a web server, to accurately detect DNMs and rare inherited mutations from NGS data in sporadic diseases.
Methods The expectation-maximisation (EM) model was adopted to accurately identify DNMs from variant call files of a trio generated by GATK (Genome Analysis Toolkit). The GATK results, which contain certain basic properties (such as PL, PRT and PART), are iteratively integrated into the EM model to strike a threshold for DNMs detection. Training sets of true and false positive DNMs in the EM model were built from whole genome sequencing data of 64 trios.
Results With our in-house whole exome sequencing datasets from 20 trios, mirTrios totally identified 27 DNMs in the coding region, 25 of which (92.6%) are validated as true positives. In addition, to facilitate the interpretation of diverse mutations, mirTrios can also be employed in the identification of rare inherited mutations. Embedded with abundant annotation of DNMs and rare inherited mutations, mirTrios also supports known diagnostic variants and causative gene identification, as well as the prioritisation of novel and promising candidate genes.
Conclusions mirTrios provides an intuitive interface for the general geneticist and clinician, and can be widely used for detection of DNMs and rare inherited mutations, and annotation in sporadic diseases. mirTrios is freely available at http://centre.bioinformatics.zj.cn/mirTrios/.
- mirTrios
- de novo mutations
- rare inherited mutations
- sporadic diseases
Statistics from Altmetric.com
Introduction
De novo mutations (DNMs), arising from meiosis of the gametes of the parents (ie, sperm and egg) and transmitted to their child, usually have severe biological or phenotypic consequences when they affect functionally important nucleotides in the genome.1 DNMs represent the most extreme form of rare genetic mutation and make these mutations prime candidates for causing sporadic genetic diseases that remain in a population despite the reduced fecundity.2 ,3 The widespread availability of next-generation sequencing (NGS), such as whole exome sequencing (WES) and whole genome sequencing (WGS), revolutionised the identification of DNMs on a genome-wide scale. Attention has been mostly focused on neuropsychiatric diseases,1–5 such as autism spectrum disorders (ASDs), schizophrenia, intellectual disability, and epileptic encephalopathy. These studies serve as pioneers, and many more large scale studies of other genetic diseases (such as congenital heart disease6) by NGS to identify risk-associated DNMs are underway.5 ,7
With the development of NGS, a number of computational methods that address multi-sample (typically parent–offspring trios) variant detection and genotype calling have been developed, such as SAMtools,8 GATK (Genome Analysis Toolkit),9 TrioCaller,10 VarScan,11 Famseq12 and VariantMaster.13 Among them, FamSeq builds on Bayesian networks to provide the probability for each genotype of each variant using data from all familial members. These methods greatly increase the power of inferring genotypes and haplotypes, but if we directly apply these methods for DNM calling, the false discovery rate will be above 60%.14 The potential error during PCR, sequencing and mapping may contribute to the false positive rate. In some cases, assumed DNMs are actually inherited mutations due to the low evenness in local genomic regions of multiple samples. Subsequently, PolyMutt,15 DeNovoGear16 and DNMFilter17 were specifically developed for DNM detection from trio-based NGS. PolyMutt and DeNovoGear investigate all available family members jointly based on likelihood framework and likelihood-based error modelling, respectively. Both algorithms relied on the average mutation rate of each class of mutations across the given genome, while de novo mutation rates were found to vary strikingly across different genomes and regions.18 DNMFilter is based on a machine-learning filtering approach to identify DNMs, the efficacy of which is sensitive to the training set. Recently, Scalpel was specifically developed to detect de novo and transmitted insertions and deletions (indels) in exome-capture data on the basis of localised assembly.19 However, all the above software require a certain level of computational skills that can handle installing, minor processing of input raw data or even debugging when incompatibility of datasets occurs. There are still no public user-friendly online services available for comprehensive analysis from family-based NGS data in sporadic diseases. In this study, mirTrios, a web server implementing the expectation–maximisation (EM) algorithm, was developed to accurately identify DNMs from trio-based or family-based variant call file (VCF) results from NGS in sporadic diseases.
Studies have revealed that rare inherited variants, existing in homozygous, hemizygous, compound heterozygous, or dominant heterozygous forms, also make substantial contributions to sporadic diseases.20–23 Thus, the identification of rare inherited mutations and the annotation of them are also provided in mirTrios. More importantly, the application of available online tools for identification of candidate genes in sporadic diseases is still insufficient. For analysis of multiple families, an adjusted TADA (Transmission And De novo Association) model24 was used to prioritise candidate genes and provide a p value for statistical evidence in sporadic genetic diseases on the basis of extensive annotation.
Methods
Accurate model for identification of DNMs
A generic VCF format file generated by GATK containing variant information of trios is required for the detection of DNMs. Due to the errors that occurred during sequencing, mapping and the variant calling process, discovering them simply by filtering based on the allowed scope of parameters, such as depth, quality and genotype, may not be sufficient to downsize false positive variants. Therefore, an EM algorithm adopted by mirTrios is used to further extract potential DNMs with closely related properties available from the VCF file (figure 1). These properties were iteratively integrated into the EM algorithm to strike a threshold for the identification of DNMs. The EM algorithm encompasses two major iterative steps:
(1) Expectation step (E step), calculating log-likelihood function on the basis of initial parameters or iterative results yielded in previous steps (the initial values were determined on the basis of a large amount of training data):
In this formula, represents the log-likelihood of variable X in the Gaussian mixture distribution Z with different iterative process θ. In addition, n denotes the total number of Gaussian distribution, and πi denotes the weight of Gaussian distribution N in the iterative progresses. Every Gaussian mixture distribution zi has a variable of xi, a mean value of and a variable of .
Expectation of the conditional distribution :
(2) Maximisation step (M step): new parameters are generated by maximising the log-likelihood function, replacing with to obtain a maximised expectation . In these steps, represents the previous iterative process, and represents the current iterative process. Z represents Gaussian mixture distribution, and and represents the mean value and square deviation, respectively.
Generally, both the number of DNMs and non-DNMs from the large amount of trio samples present normal distributions (Gaussian distribution, Kolmogorov-Smirnov test, p<0.001), and jointly demonstrates a Gaussian mixture distribution. Based on the sample of Gaussian mixture distribution, we adopted the above described EM model to distinguish DNMs and non-DNMs, resulting in the probability:
In the formula, , and ; n denotes the total number of normal distribution; πi denotes the weighting coefficient of each normal distribution represented by . The variable xi is distributed normally with a mean value of and a variance of .
All the DNM-related properties from VCF results generated by GATK are integrated into an EM model, which will be applied iteratively to strike a threshold for each variable that is essential for detection of DNMs. Several properties, QUAL (quality of alignment), Depth (total sequencing depth), QD (variant confidence/quality by depth), MQ0 (number of reads with mapping quality equal to 0), PL (the maximum Phred-scaled likelihoods for genotypes in either parents or child), BT (depth of child/depth of parents), PRT (the maximum percent of the covered reads in proband with reference calls), and PART (the minimum percent of the covered reads in parents with reference calls) are closely relevant to DNMs. Among these properties, PL, PRT and PART are related to family information while the rest are independent from each other. Family information is crucially important to the determination of DNMs, so we also took PL, PRT and PART into inferential account. For those related individuals, we adopted the Bayesian framework to classify them:
In the formula,
C refers to proband, F to father and M to mother. denote the probability of concurrence of maternal homozygous and proband heterozygous, paternal homozygous and proband heterozygous, respectively. The probability of the proband being heterozygous and the parents being homozygous, which is required for the accurate detection of DNMs, can be obtained by this Bayesian framework.
We used the DNMs validated by Sanger sequencing to build the training set for our EM model. The training set containing both true positive and negative DNMs were extracted from previously published 32 ASD trios25 and WGS datasets of our in-house, unpublished, 32 other ASD trios. These data were used to generate the initial values in the EM module (such as n, πi, and the variance) for each of the DNM related properties sourced from VCF results. In particular, n refers to the two different Gaussian distributions, DNMs and non-DNMs; refers to the mean value of each of the properties (such as QD, MQ0, PL, etc) in the training data. In addition, the initial weight πi was assigned equally at the first time of the iterative process.
Detection of rare inherited mutations
mirTrios identifies inherited mutations directly from trio-based VCF outputs generated by GATK based on the Phred-scaled probability score and reads depth with related high sensitivity and accuracy.9 ,10 The inherited models of mutations were classified into four types according to the genotypes: homozygous or compound heterozygous mutation (Hom), X-linked hemizygous mutation in male (Hem), transmitted heterozygous mutation (THet), and non-transmitted heterozygous mutation (NHet). The four inherited models cause disruption of genes at different levels. Hom affects all copies of genes; Hem disrupts the only copy of genes on the X chromosome in males; while THet and NHet implicate only one copy of genes in the proband and parents, respectively. In addition, mirTrios removed all common mutations by user defined frequency threshold in dbSNP137, ESP6500,26 and 1000 Genomes (released in April 2012)27 (figure 1).
Annotation of variants
mirTrios employs ANNOVAR28 to annotate DNMs and rare inherited mutations with RefSeq (hg19, from UCSC). The annotation information of mutations contains the locations in different genomic regions (exonic, intronic, splicing, intergenic, etc) and the effects on protein coding in coding region (stopgain, frameshift, synonymous, missense, etc). Loss-of-function (LoF) mutations (stopgain, stoploss and splicing single nucleotide variants (SNVs) and frameshift indels) were directly used to prioritise disease candidate genes. Moreover, genes harbouring only synonymous SNVs or non-frameshift indels which were less likely to contribute to disease were eliminated from our candidate list. For non-synonymous SNVs, though many methods or tools have been developed to predict the degree of damages based on evolutionary conservation and functional disruption, all of them have inevitable limitations and biases. A proposed solution for this is to use consensus prediction or majority vote of many methods.29 ,30 To this end, mirTrios integrates 12 methods for functional prediction, namely SIFT (Sorting Intolerant from Tolerant),31 Polyphen2_hvar,32 Polyphen2_hdiv,32 MutationTaster,33 MutationAssessor,34 LRT,35 FATHMM (Functional Analysis through Hidden Markov Models),36 GERP++ (Genomic Evolutionary Rate Profiling),37 PhyloP,38 SiPhy,39 ,40 RadialSVM and MetaLR in dbNSFP.29 ,30 Users can define which of these 12 methods to be used to set pathogenicity thresholds (figure 1).
Prioritisation of candidate genes
Since both de novo and rare inherited mutations are strongly associated with sporadic diseases,20–23 integrating both of them can be a highly effective way to prioritise candidate genes. TADA incorporates de novo mutations and rare transmitted/non-transmitted heterozygous mutations and adopts parameters for allele frequencies and gene-specific penetrance for risk gene identification.24 However, LoF/damaging homozygous, compound heterozygous and hemizygous mutations are not taken into account in the primary TADA model. To enrich the prediction model, mirTrios made minor adjustments to the TADA model,24 and serves to make more accurate predictions of candidate genes, assuming that the effects of those three mutations are equal. The slightly adjusted TADA programme was used to calculate the p value of each gene harbouring rare LoF or damaging mutations (ie, extreme mutations) with statistical support (figure 1).
Non-coding region analysis
Currently, an increasing number of studies have demonstrated that the non-coding regions play important roles in gene regulation, RNA processing, and biological networks.41 Mutations in the non-coding region have been demonstrated to be associated with many diseases. Therefore, mirTrios supplies de novo and rare inherited mutation annotation in non-coding elements by FunSeq41 to discover candidate disease drivers with the selected annotation information integrated in this tool. These selected non-coding regions were classified into six functional categories including ENCODE annotation, sensitive region, ultrasensitive region, known transcription factor motif, promoter or enhancer of target genes, and hub of target. The de novo and rare inherited mutations will be assigned a score ranging from 0 to 6, corresponding to its location at different regions, to prioritise non-coding variants (figure 1).
Results
Assessment of identification of DNMs
We tested the performance of mirTrios with WES datasets of our in-house 20 case–parent trios with sporadic ASDs or high myopia generated by WES. We jointly used mirTrios, PolyMutt, DeNovoGear, DNMFilter, and Triodenovo (http://genome.sph.umich.edu/wiki/Triodenovo) to identify DNMs and totally generate 45 predicted DNMs in coding regions, 27 of which are true positive validated by Sanger sequencing (figure 2A, see online supplementary materials and methods, supplementary table S1). We also compared the accuracy of single-sample calling and multi-sample calling by GATK. Single-sample calling identified 31 more predicted DNMs, but none of them are true positive (figure 2B). By contrast, multi-sample calling has a higher specificity (50% vs 35.5%), yet is still lower than other methods, which suggests that multi-sample calling greatly increases the power of inferring genotypes and haplotypes. Despite a 96.3% sensitivity, the low specificity is a remaining problem for detection of DNMs. Therefore a specialised tool for DNM calling is required. mirTrios detected 27 putative DNMs, 25 of which are true positive, presenting higher sensitivity and specificity (both 92.6%) than PolyMutt, denovoGear and DNMFilter. Triodenovo has a somewhat higher sensitivity (96.3%), but lower specificity (89.7%) than mirTrios. Moreover, mirTrios provides a web-based interface for DNMs and rare inherited mutation detection and candidate gene prioritisation (figure 2B). For the 27 true positive DNMs, all of them were detected by at least two tools and 17 were in the intersection of all four tools. In addition, for the other negative calls, most of them are detected only by one tool. These results indicate that mirTrios achieved a relatively high sensitivity and specificity for detection of DNMs (figure 2B).
To provide a guidance for users and define the optimal parameter values for DNM detection by mirTrios, we generated a large amount of simulated data and compared the detection results using a range of parameters (see online supplementary materials and methods). Results showed that some parameters do have an effect on the specificity and sensitivity of DNM detection (see online supplementary figure S1). Based on our simulated data, mirTrios provide an optimal value for each parameter with both high specificity and sensitivity (see online supplementary materials and methods).
Data inputs
In order to facilitate the use of our tools for clinicians lacking sufficient bioinformatics skills, mirTrios provides an intuitive interface to allow user-defined options to customise detection and annotation of de novo and rare inherited mutations generated by trios-based NGS in sporadic disease (figure 2B). Based on these detected mutations and extensive annotation, mirTrios also supports prioritisation of candidate genes. A VCF format file (V.4) generated by GATK and a family list file containing the genetic relationship in each nuclear family are required for mirTrios input (figure 3A). To reduce the input size, all input VCF format files can be compressed into .tar, .gz, .tar.gz, or .tar.bz2 formats. mirTrios allows users to effectively upload the VCF files via the web page or file transfer protocol (FTP) server. After successfully uploading the data, users could start analysis with customised parameters by which the efficiency of the detection of DNMs and rare inherited mutations and annotations could be effectively managed. More importantly, this flexible customisation enables users to re-analyse uploaded data independently through different combinations of parameters.
To make mirTrios more convenient, the mirTrios stand-alone version supports BAM files as inputs. Public users can download this freely available stand-alone program from the mirTrios website. Since the size of a BAM file is generally 100-fold greater than that of a VCF file (eg, the size of BAM and VCF files of an exome are 5 GB and 50 MB, respectively), which is a stumbling block for uploading, the web server of mirTrios will only support VCF files as input. However, it is noted that mirTrios is specifically developed for comprehensive analysis of sporadic diseases instead of familiar diseases, such as three generation families. It is considered that familiar diseases are generally used to identify rare inherited variations, which are supposed to segregate with disease, rather than DNMs. Therefore, the current version of mirTrios only works on nuclear families with multiple probands and/or siblings and their unaffected parents.
Data outputs
The analytical results can be retrieved and browsed by a unique identifier which is generated immediately after the data are uploaded successfully (figure 3A). A typical output includes four sections: DNMs and annotation; rare inherited mutations and annotation; disease candidate genes; and non-coding region analysis (figure 3B–E). These four sections are well organised to demonstrate the results of each part. The first section illustrates all the detected DNMs and annotations of them, including mutation loci (exonic, splicing, 5′UTR, upstream, etc), mutational type (SNV, insertion and deletion), and effects on coding region (stoploss, stopgain, non-synonymous, synonymous, frameshift, etc), as well as the annotation in various public databases, such as dbSNP138, ESP6500,26 and 1000 Genomes.27 For non-synonymous SNVs, mirTrios provides a predicted pathogenicity score based on 12 methods, which can be modified electively. More importantly, mirTrios also supports the detection of known diagnostic mutations and disease-related genes based on five resources: OMIM (Online Mendelian Inheritance in Man),42 MGI,43 HGMD (Human Gene Mutation Database),44 COSMIC (Catalogue of Somatic Mutations in Cancer),45 and ClinVar.46 This is powerful for the identification of known functional mutations and novel candidate genes. The second section showed all classes of detected rare inherited mutations (homozygous or compound heterozygous, X-linked hemizygous, and transmitted/non-transmitted heterozygous mutations) with detailed annotation similar to DNMs. The disease candidate genes section displays all the potential disease-associated genes, which contain at least one extreme mutation (damaging de novo or rare inherited mutation) with a given p value. In this section, mirTrios clearly provide the count of LoF/damaging DNMs, and transmitted/non-transmitted rare inherited mutations in each gene. The optional non-coding region analysis results will be generated if users provide the whole genome trios sequencing data. In this section, mirTrios shows all detected de novo and rare inherited mutations located in the functional non-coding region. Based on the sequence location, mirTrios provides a score ranging from 0 (less deleterious) to 6 (more deleterious) to estimate the deleterious effect of variations.
Discussion
The rapid advances of WES/WGS technologies have greatly facilitated clinical genetic diagnosis genome-wide.47 ,48 For sporadic disease, despite the minor role of common mutations or the environment, LoF/damaging DNMs is an important source of causality.49 In addition, rare inherited mutation also contributes to the risk of sporadic disease, such as ASD22 and schizophrenia.50 The vast amount of mutations generated by NGS poses multiple challenges for the identification of functional mutations and candidate genes.
At present, although a few tools have been developed to detect DNMs or candidate genes by NGS, there are still no public online services available for comprehensive analysis of trios-based NGS data. Therefore, we have developed a novel and comprehensive platform, mirTrios, for the analysis of trio-based WES/WGS VCF results, which allows accurate detection and annotation of DNMs and rare inherited mutation in coding and non-coding regions. For the average geneticist and clinician, the integrated framework of mirTrios avoids the cumbersome process of complex installation, redundant operations, and requirement for high-performance computational capability. For multiple trios analysis, mirTrios also provides an integrated framework for known diagnostic variant identification and candidate gene prioritisation based on the detected de novo and rare inherited mutations from the large amount of data generated by NGS. In essence, mirTrios provides comprehensive and meaningful data for users to study in depth the genetic basis of sporadic diseases.
mirTrios provides an intuitive interface for users to upload files directly by web page or ftp address, which can be widely used by researchers to explore the functional mutation and candidate genes in sporadic disease. mirTrios is freely available for non-commercial use and will be updated regularly to keep up with the latest resources of the implemented databases. Restricted by the lack of a sophisticated algorithm for detecting de novo CNV and SV (structural variation), mirTrios currently only provides point mutation analysis. In this aspect, mirTrio will be updated with state-of-art de novo CNV/SV detection and integrate these tools with optimal accuracy and specificity. We believe mirTrios will be very helpful for the study of sporadic disease.
Implementation
mirTrios is freely available at http://centre.bioinformatics.zj.cn/mirTrios/. Documentation and example data can be found on the website. The web client of mirTrios was implemented independently and has been successfully tested with different releases of Microsoft Internet Explorer 11.0, Firefox 30.0, Google Chrome 35.0, and Safari 5.1 (under different versions of MacOS, Microsoft Windows and Linux). mirTrios was constructed under an Apache/PHP/MySQL environment on the Red Hat Enterprise 5.5 Linux operating system. The uploaded VCF data will be analysed on our five computational nodes, with 16 CPUs and 32 GB of RAM in each node.
Acknowledgments
The authors thank Dr Yong-hui Jiang and Dr Ming-bang Wang for helpful training data support.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Files in this Data Supplement:
- Data supplement 1 - Online supplement
- Data supplement 2 - Online supplement
Footnotes
-
JL and YJ contributed equally.
-
Contributors JW, ZSS and KX: designed and supervised the study. JL, YJ, TW, and HC drafted the manuscript. JL, YJ, TW, QS, and XR: developed the web server. YJ, TW, QX, and JL: developed the method to detect DNMs. JL: implemented the method to detect rare inherited mutations and prioritisation of candidate genes. All the authors read and approved the manuscript.
-
Funding The project was funded by the National Basic Research Program of China (No. 2012CB517902 and 2012CB517904), the National "12th Five-Year" scientific and technological support projects (No. 2012BAI03B02), and the Special Funds of National Health and Family Planning Commission of China (No. 201302002).
-
Competing interests None.
-
Patient consent Obtained.
-
Provenance and peer review Not commissioned; externally peer reviewed.
-
Data sharing statement Datasets used in the manuscript are available on the mirTrios web server.