Background High-throughput DNA sequencing platforms have become widely available. As a result, personal genomes are increasingly being sequenced in research and clinical settings. However, the resulting massive amounts of variants data pose significant challenges to the average biologists and clinicians without bioinformatics skills.
Methods and results We developed a web server called wANNOVAR to address the critical needs for functional annotation of genetic variants from personal genomes. The server provides simple and intuitive interface to help users determine the functional significance of variants. These include annotating single nucleotide variants and insertions/deletions for their effects on genes, reporting their conservation levels (such as PhyloP and GERP++ scores), calculating their predicted functional importance scores (such as SIFT and PolyPhen scores), retrieving allele frequencies in public databases (such as the 1000 Genomes Project and NHLBI-ESP 5400 exomes), and implementing a ‘variants reduction’ protocol to identify a subset of potentially deleterious variants/genes. We illustrated how wANNOVAR can help draw biological insights from sequencing data, by analysing genetic variants generated on two Mendelian diseases.
Conclusions We conclude that wANNOVAR will help biologists and clinicians take advantage of the personal genome information to expedite scientific discoveries. The wANNOVAR server is available at http://wannovar.usc.edu, and will be continuously updated to reflect the latest annotation information.
- Mendelian diseases
- genetic variation
- functional annotation
- disease mutation
Statistics from Altmetric.com
Over the past 5 years, massively parallel DNA sequencing platforms have become widely available.1 As a result, variants data on genomes from healthy subjects and patients are being generated at an unprecedented rate. However, the development of bioinformatics tools for handling these data lags behind, creating a gap between the generation of massive data and the ability to fully exploit the biological contents of these data. To fill the urgent demand, we previously developed the ANNOVAR (ANNOtate VARiation) software for functional annotation of genetic variants from sequence data.2 ANNOVAR efficiently uses up-to-date information to annotate genetic variants detected from diverse genomes with user-specified versions of genome builds. Although ANNOVAR has become one of the most widely used annotation tools for sequencing data, the requirement to type command line arguments makes ANNOVAR inaccessible to the average biologists and clinicians who would otherwise benefit from its extensive functionality.
Therefore, we developed a web server called wANNOVAR to facilitate web-based personal genome annotation, using ANNOVAR as the back-end annotation engine. Users need to simply submit a list of variants (even whole-exome or whole-genome variants), and wANNOVAR can process the submission and generate HTML-based result pages. It allows flexibility by permitting the users to select customised filtering criteria and identify a subset of prioritised variants from thousands or even millions of input variants. Below, we describe the implementation of the wANNOVAR sever and illustrate its utility using two high-throughput sequencing data sets on Mendelian diseases.
The web server is composed of a web interface and a background program for executing annotation tasks. Our tests indicated that the server performed well under a light load for user queries. For example, annotating an exome with ∼20 000 SNPs and indels takes merely a few minutes in the server. The subroutines for handling user query were written in Perl and were facilitated by the Common Gateway Interface module (CGI.pm). The static and dynamic HTML pages have been tested in different versions of Internet Explorer, Firefox and Google Chrome browsers.
Input fields for the wANNOVAR server include a sample identifier, an email address, a variant file, the reference genome build, the gene definition system and optionally a disease model for running the ‘variants reduction’ pipeline. The default input format for the variant file is variant call format (VCF),3 which is a text file that contains meta-information lines, a header line, and data lines containing information about a position in the genome. The server can also handle other input formats, including the ANNOVAR input format, the Complete Genomics ASM.tsv format and the GFF3-SOLiD format. Currently, the input file size is restricted to less than 200 MB, and the input file can be compressed in .gz or .zip format. If all input fields are correctly set, the server will return a webpage with a URL for the results page.
The results page contains a collection of functional annotations for variant calls. Users can download the ‘exome summary results’ or the ‘genome summary results’ as Excel-compatible files or tab-delimited files, or choose to view the annotation results in a table on the webpage. The annotations on all variants were grouped into several broad categories including gene annotation, variation databases, functional prediction and region annotations (table 1). Several functional prediction scores for exonic variants from the dbNSFP Database4 including SIFT,5 PolyPhen,6 LRT,7 MutationTaster8 and PhyloP,9 are also provided in the wANNOVAR server to help users judge the functionality of variants using multiple sources of information. As previously described, wANNOVAR can perform a ‘variants reduction’ procedure to identify a subset of the most likely causal variants/genes for Mendelian diseases, from a large list of variants on personal genomes.2 For example, users can remove variants observed in public databases such as the 1000 Genomes Project,10 NHLBI-ESP 5400 exomes11 and dbSNP12 with specific minor allele frequency cut-off. The server uses modified versions of dbSNP that excluded all SNPs flagged as ‘clinically associated’ by dbSNP. We provide several default pipelines for different disease models such as ‘rare recessive Mendelian disease’ and ‘rare dominant Mendelian disease’, but users can also use ‘advanced options’ to specify a custom filtering strategy (table 2).
Analysis of a real exome sequencing data set on Ogden syndrome
To demonstrate the utility of the wANNOVAR server, we analysed variants calls from a family segregating Ogden syndrome ([MIM: 300855]). Thirty years ago, Ogden syndrome was discovered as an X linked lethal infantile disorder, and its genetic basis was recently solved by next-generation sequencing.13 The disease is characterised by postnatal growth failure with severe delays and dysmorphic features, and is caused by a mutation in the NAA10 gene, leading to a N-terminal acetyltransferase deficiency. For the family with Ogden syndrome, exon-capture sequencing data was aligned by BWA14 and genotypes were called by GATK15 as VCF3 files in hg19 coordinate. We submitted all chromosome X variants (1318 single nucleotide variants and 161 indels) in the proband to the wANNOVAR server, and tested the ‘variants reduction’ procedure using the default ‘rare recessive Mendelian disease’ pipeline and a custom pipeline (table 2). Compared with the default pipeline, the custom pipeline filter variants set against the two unaffected family members and the deleterious variants were identified using SIFT/PolyPhen scores. Both pipelines identified a hemizygous mutation (p.S37P) within a single candidate gene NAA10, and this was precisely the known causal variant in this family.13 Detailed examination of the ‘exome summary’ table demonstrated that this variant has a SIFT5 score of 0 (prediction: damaging), PolyPhen6 score of 0.96 (prediction: probably damaging), LRT7 score of 1 (prediction: deleterious), MutationTaster8 score of 1 (prediction: disease causing), PhyloP9 score of 0.96 (prediction: conserved) and GERP++16 score of 3.55 (prediction: highly constrained). The variant is not observed in the 1000 Genomes Project,10 the dbSNP12 version 135 (after removing SNPs flagged as ‘clinically associated’) or the NHLBI-ESP 5400 exomes.11 Therefore, converging bioinformatics evidence supports that this variant may affect protein function.
Analysis of a synthetic whole-genome sequencing data set on Miller syndrome
We next evaluated wANNOVAR on millions of genetic variants from whole-genome sequencing. We used a synthetic data set of a male subject with ∼4.2 million single nucleotide variants and ∼0.5 million indels,17 supplemented with two variants (p.G152R and p.G202A) in DHODH known to cause Miller syndrome ([MIM: 263750]).18 This synthetic data set was previously used to illustrate the ‘variant reduction’ procedure.2 With the default ‘rare recessive Mendelian disease’ pipeline (table 2), the large number of input variants was drastically reduced to 516, and 24 candidate genes were identified including the causal gene DHODH. We also tested a custom pipeline that additionally identifies variants in conserved genomic regions19 and outside of segmental duplication regions20 (table 2). This custom pipeline identified ten candidate genes including DHODH, similar to what has been previously reported.2 Finally, we tested a different custom pipeline that additionally remove variants with SIFT score >0.05 and PolyPhen2 score <0.85. This custom pipeline identified 14 candidate genes (table 2), but DHODH was not among them because one of the mutations (p.G202A) was predicted as tolerated by SIFT (score =0.18) and benign by PolyPhen (score =0.69). However, we note that the variant was correctly predicted as deleterious by LRT, MutationTaster, PhyloP and GERP++. We caution that these algorithms present predictions that help users prioritise variants/genes, but the true sensitivity/specificity will depend on many factors, and that none of the algorithms constitute proof of being disease causal. In summary, this example has confirmed the utility of the wANNOVAR server in identifying a prioritised list of candidate disease causal genes, yet cautioned the judicious use of function prediction scores.
In this manuscript, we presented a web server called wANNOVAR for performing web-based functional annotation of genetic variants from personal genomes. Below we compare the server with other competing approaches and discuss potential future extensions and development.
Several similar web servers exist, including SIFT,5 PolyPhen6 and the SeattleSeq server.21 The wANNOVAR server already incorporates SIFT and PolyPhen2 scores with additional scoring systems (table 1).4 The wANNOVAR server differs from SeattleSeq in that: (1) it allows flexibility by permitting the users to select gene definition systems, including RefSeq genes,22 ENSEMBL genes,23 UCSC genes24 or GENCODE genes.25 Compared with the manually compiled RefSeq gene definitions, ENSEMBL genes and UCSC genes are supplemented with computational predictions of transcripts and genes. The GENCODE genes are compiled by a combination of initial manual annotation and experimental validation by the GENCODE consortium, and a refinement of the annotation based on these experimental results. All of the four gene definition systems are widely used in human genomic studies; (2) wANNOVAR produces more annotation results including predicted functional importance scores for non-synonymous variants; (3) wANNOVAR builds in a ‘variants reduction’ pipeline to facilitate identifying potential disease causal variants and genes from personal genomes.
The wANNOVAR server will be under constant development to improve its functionality. Some of the future plans include: First, we will explore the possibility of allowing FTP access to users with limited internet connection speed for uploading files. Second, we will add more annotation tasks for non-coding variants, splicing variants and UTR variants. Currently, the available annotations are strongly biased towards non-synonymous variants. With the accumulation of cell-type specific data on functional elements from large-scale genomics project, such as the ENCODE project26 and the development of bioinformatics methods and databases,27–32 we will be able to provide more annotations for variants outside of coding regions. Third, we will test the use of a backend computing cluster rather than a frontend web server to perform the actual annotation tasks to handle multiple simultaneous user queries. Fourth, we will explore the use of GALAXY33 and design a plug-in based on ANNOVAR, for better annotating, processing and visualising variants.
In summary, wANNOVAR is an easy-to-use online tool for batch annotation of genetic variants. Given the rapid generation and accumulation of whole-exome or whole-genome sequencing data in research and clinical settings, we expect that wANNOVAR will help biologists and clinicians take advantage of personal genome information in various medical genetics applications.
The authors thank James Knowles, Jalas Chaim, Mingyao Li, Gholson Lyon and the three anonymous reviewers for testing the server and providing valuable feedbacks.
Funding The study is supported by start-up funds from the Zilkha Neurogenetic Institute and grant number HG006465 from NIH/NHGRI (K.W.).
Competing interests None.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement The two data sets used in the manuscript are available to users in the “Example” section of the wANNOVAR server.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.