Article Text

Original article
SeqHBase: a big data toolset for family based sequencing data analysis
  1. Min He1,2,3,
  2. Thomas N Person1,
  3. Scott J Hebbring1,3,
  4. Ethan Heinzen4,
  5. Zhan Ye2,
  6. Steven J Schrodi1,3,
  7. Elizabeth W McPherson5,
  8. Simon M Lin6,
  9. Peggy L Peissig2,
  10. Murray H Brilliant1,3,
  11. Jason O'Rawe7,
  12. Reid J Robison8,
  13. Gholson J Lyon7,8,
  14. Kai Wang8,9,10
  1. 1Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA
  2. 2Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA
  3. 3Department of Computation and Informatics in Biology and Medicine, University of Wisconsin-Madison, Madison, Wisconsin, USA
  4. 4College of Science and Engineering, University of Minnesota-Twin Cities, Minnesota, Minnesota, USA
  5. 5Department of Medical Genetics Services, Marshfield Clinic, Marshfield, Wisconsin, USA
  6. 6The Research Institute at Nationwide Children's Hospital, Columbus, Ohio, USA
  7. 7Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, Cold Spring Harbor, New York, USA
  8. 8Utah Foundation for Biomedical Research, Provo, Utah, USA
  9. 9Zilkha Neurogenetic Institute, University of Southern California, Los Angeles, California, USA
  10. 10Department of Psychiatry, University of Southern California, Los Angeles, California, USA
  1. Correspondence to Dr. Min He, Center for Human Genetics, Marshfield Clinic Research Foundation, 1000 N Oak Ave, Marshfield, WI 54449, USA; he.max{at}marshfieldclinic.org and Dr. Kai Wang, Zilkha Neurogenetic Institute, University of Southern California, 1501 San Pablo St, Los Angeles, CA 90089, USA; kaiwang@usc.edu

Abstract

Background Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis.

Methods Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation).

Results We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data.

Conclusions These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.

  • whole-genome sequencing
  • whole-exome sequencing
  • big data
  • de novo mutations
  • inherited homozygous or compound heterozygous mutations

This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

    Files in this Data Supplement: