Article Text

PDF

Communications
WGSA: an annotation pipeline for human genome sequencing studies
  1. Xiaoming Liu1,2,
  2. Simon White3,
  3. Bo Peng4,
  4. Andrew D Johnson5,6,
  5. Jennifer A Brody7,
  6. Alexander H Li1,
  7. Zhuoyi Huang3,
  8. Andrew Carroll8,
  9. Peng Wei1,9,
  10. Richard Gibbs3,
  11. Robert J Klein10,
  12. Eric Boerwinkle1,2,3
  1. 1Human Genetics Center, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, USA
  2. 2Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, USA
  3. 3Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas, USA
  4. 4Department of Bioinformatics and Computational Biology, University of Texas MD Anderson Cancer Center, Houston, Texas, USA
  5. 5NHLBI Framingham Heart Study, Bethesda, Maryland, USA
  6. 6Population Sciences Branch, NHLBI Division of Intramural Research, Bethesda, Maryland, USA
  7. 7Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, Washington, USA
  8. 8DNAnexus, Mountain View, California, USA
  9. 9Department of Biostatistics, School of Public Health, University of Texas Health Science Center at Houston, Houston, Texas, USA
  10. 10Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, Icahn Institute for Genomics and Multiscale Biology, New York, New York, USA
  1. Correspondence to Dr Xiaoming Liu, University of Texas School of Public Health, Human Genetics Center, 1200 Herman Pressler Street, E529, Houston, TX 77030, USA; Xiaoming.Liu{at}uth.tmc.edu

Statistics from Altmetric.com

DNA sequencing technologies continue to make progress in increased throughput and quality, and decreased cost. As we transition from whole exome capture sequencing to whole genome sequencing (WGS), our ability to convert machine-generated variant calls, including single nucleotide variant (SNV) and insertion-deletion variants (indels), into human-interpretable knowledge has lagged far behind the ability to obtain enormous amounts of variants. To help narrow this gap, here we present WGSA (WGS annotator), a functional annotation pipeline for human genome sequencing studies, which is runnable out of the box on the Amazon Compute Cloud and is freely downloadable at (https://sites.google.com/site/jpopgen/wgsa/).

Functional annotation is a key step in WGS analysis. In one way, annotation helps the analyst filter to a subset of elements of particular interest (eg, cell type specific enhancers), in another way annotation helps the investigators to increase the power of identifying phenotype-associated loci (eg, association test using functional prediction score as a weight) and interpret potentially interesting findings. Currently, there are several popular gene model based annotation tools, including ANNOVAR,1 SnpEff2 and the Ensembl Variant Effect Predictor (VEP).3 These can annotate a variety of protein coding and non-coding gene models from a range of species. It is well known among practitioners that different databases (eg, RefSeq4 and Ensembl5) use different models for the same gene. Even when the same gene structure is implemented, predicted consequences of a given variant from different annotation tools may not be in agreement.6 Therefore, it has been suggested to obtain annotation from tools across multiple databases for a more complete interpretation of the variants discovered in WGS.6 Annotations of coding and non-coding variants include scores pertaining to functionality, conservation, population allele frequencies and disease-related annotations, that is, known disease-causing variants and disease-associated variants identified in genome-wide association analyses. Recent large-scale epigenomics projects provide rich data sets of cell-specific regulatory elements. Unfortunately, there are currently few tools available to integrate all those functional annotation resources and provide a convenient and efficient pipeline for annotating millions of variants discovered in a WGS study.

To facilitate the functional annotation step of WGS, we developed WGSA. Currently WGSA supports the annotation of SNVs and indels locally without remote database requests, allowing it to scale up for large WGS studies. The overview of the WGSA pipeline is presented in figure 1. The complete list of the resources (and their references) contained in WGSA can be found in online supplementary table S1. For gene-model based annotation, WGSA integrates the outputs from three annotation tools (ANNOVAR, SnpEff and VEP) versus two databases (RefSeq and Ensembl), and provides a summary of variant consequences from the six annotation results. To further speed up the process for large-scale WGS studies, we have precomputed annotations for all potential human SNVs (a total of 8 584 031 106) based on human reference hg19 non-N bases and use it as a local database. For SNV-centric resources, WGSA integrates five functional prediction scores, eight conservation scores, allele frequencies from four large-scale sequencing studies, variants in four disease-related databases, among others (figure 1 and online supplementary table S1). For regulatory region-centric resources, WGSA includes cell type specific transcription factor binding sites, DNAse I hypersensitivity regions and chromosome activity predictions from three epigenomics projects (figure 1 and online supplementary table S1). WGSA also contains rich functional annotations for non-synonymous SNVs and genes from our dbNSFP database.7 ,8

Figure 1

Flow chart of the whole genome sequencing annotator (WGSA) annotation pipeline. Dotted lines show the ‘detour’ of the indel annotation via the single nucleotide variant (SNV) annotation pipes.

Annotating the consequences of indels raises special challenges. In addition to the allele frequencies and consequences predicted by the three annotation tools (ANNOVAR, SnpEff and VEP), we take an approach to first ‘translate’ an indel to local SNVs by incorporating their effect (insertion, deletion, replacement) on nearby flanking sequences, annotating those SNVs with other available resources, and then summarising these results for the indel (figure 1, online supplementary figure S1 and details in online supplementary notes). Although this approach is a simplification of the potentially complicated impact of an indel, it will provide additional information on an indel regarding the focal region it resides in, such as whether it is a short tandem repeat mutation, whether it breaks a transcription factor binding site, local functional prediction scores, local conservation scores, etc.

To provide convenient access for a broad community, we have built an Amazon Machine Image for running WGSA on the cloud via Amazon Web Services. To run WGSA, the user only needs to upload a variant list file (or a .vcf file) and a configuration file, containing a list of the resources to be included in the annotation. The annotation pipeline can be run with just two command line calls. We also provide WGSA as a downloadable version at https://sites.google.com/site/jpopgen/wgsa for bioinformaticians who prefer to build WGSA locally. It can be used as a foundation resource for customised annotation pipelines, as is the case for Baylor College of Medicine's Human Genome Sequencing Center annotation software, Cassandra (https://www.hgsc.bcm.edu/software/cassandra). The run time of two experimental runs with 46.6 million variants is shown in online supplementary table S2. WGSA was written in Java so that most of its annotation modules (the whole SNV annotation and most of the indel annotation except VEP) can easily be run across different platforms. Detailed protocols for using WGSA in the cloud and building it locally can be found in online supplementary notes and at (https://sites.google.com/site/jpopgen/wgsa/). As changes inevitably occur in the source databases and as new annotation resources emerge, WGSA will be updated in response. Users will be able to receive update notices and detailed instructions on updating steps (for the local version).

Acknowledgments

The authors thank the researchers of the CHARGE sequencing project, especially Drs L Adrienne Cupples and Fuli Yu and other members of the CHARGE Analysis & Bioinformatics Committee for their contributions. The authors thank Mike Dahdouli for providing technical support and Jin Yu for advising on Amazon Web Services usage.

References

View Abstract

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors XL, ADJ, JAB, AHL, AC, PW, ZH, RJK and EB designed the study. XL collected the annotation resources and developed the tool. SW tested the pipeline. BP provided tools for retrieving the RegulomeDB data set. EB and RG supervised the study. XL, SW and EB wrote the draft manuscript and all authors provided critical edits.

  • Funding This study was supported by the US National Institutes of Health (5RC2HL102419 and U54HG003273).

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.