Article Text

Download PDFPDF

Original article
Whole-genome sequencing analysis of CNV using low-coverage and paired-end strategies is efficient and outperforms array-based CNV analysis
  1. Bo Zhou1,2,
  2. Steve S Ho1,2,
  3. Xianglong Zhang1,2,
  4. Reenal Pattni1,2,
  5. Rajini R Haraksingh3,
  6. Alexander E Urban1,2,4
  1. 1 Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, California, USA
  2. 2 Department of Genetics, Stanford University School of Medicine, Stanford, California, USA
  3. 3 Department of Life Sciences, The University of the West Indies, Saint Augustine, Trinidad and Tobago
  4. 4 Program on Genetics of Brain Function, Stanford Center for Genomics and Personalized Medicine, Tasha and John Morgridge Faculty Scholar, Stanford Child Health Research Institute, Stanford University, Stanford, California, USA
  1. Correspondence to Dr Alexander E Urban, Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, Stanford, CA 94305, USA; aeurban{at}stanford.edu

Abstract

Background Copy number variation (CNV) analysis is an integral component of the study of human genomes in both research and clinical settings. Array-based CNV analysis is the current first-tier approach in clinical cytogenetics. Decreasing costs in high-throughput sequencing and cloud computing have opened doors for the development of sequencing-based CNV analysis pipelines with fast turnaround times. We carry out a systematic and quantitative comparative analysis for several low-coverage whole-genome sequencing (WGS) strategies to detect CNV in the human genome.

Methods We compared the CNV detection capabilities of WGS strategies (short insert, 3 kb insert mate pair and 5 kb insert mate pair) each at 1×, 3× and 5× coverages relative to each other and to 17 currently used high-density oligonucleotide arrays. For benchmarking, we used a set of gold standard (GS) CNVs generated for the 1000 Genomes Project CEU subject NA12878.

Results Overall, low-coverage WGS strategies detect drastically more GS CNVs compared with arrays and are accompanied with smaller percentages of CNV calls without validation. Furthermore, we show that WGS (at ≥1× coverage) is able to detect all seven GS deletion CNVs >100 kb in NA12878, whereas only one is detected by most arrays. Lastly, we show that the much larger 15 Mbp Cri du chat deletion can be readily detected with short-insert paired-end WGS at even just 1× coverage.

Conclusions CNV analysis using low-coverage WGS is efficient and outperforms the array-based analysis that is currently used for clinical cytogenetics.

  • read-depth analysis
  • discordant read-pair analysis
  • mate-pair sequencing
  • array Cgh (acgh)
  • copy-number variation (cnv)
View Full Text

Statistics from Altmetric.com

Introduction

A large portion of human genetic diversity is contributed by CNVs.1–5 Many CNVs, typically small deletions or duplications, are common, that is, present at an overall frequency of >1% in the human population.3–6 Large CNVs are relatively rare and are often associated with human disease.7–14 Having technologies available for the reliable and accurate detection and characterisation of CNVs in a given human genome is highly relevant for both clinical diagnostics and basic research. Microarray-based CNV analysis has become a first-tier clinical cytogenetics procedure in patients with unexplained developmental delay/intellectual disability,15 autism spectrum disorder,16 multiple congenital anomalies17 and cancer.13 14

The highest sensitivity and resolution in CNV detection is achieved through deep-coverage, paired-end whole-genome sequencing (WGS).5 18 However, the cost of what is currently the standard for deep-coverage WGS (>30× coverage using short-insert paired-end reads) is still considerably higher than that of arrays; turnaround time is much longer since the samples have to go through an offsite core, and the computational requirements are also very substantial regarding hardware and time. Analysis by deep-coverage WGS methods can not only detect CNVs but also SNPs, short insertions and deletions as well as, with some limitations, sequence variants that are quite challenging to parse out such as inversions and retrotransposition events. However, for clinical cytogenetic applications, such types of variants are for the most part not yet interpretable as to their effects.

With the advent of bench-top high-throughput DNA sequencers, it is now possible to perform low-coverage WGS onsite instead of through a sequencing core facility. To make the most use of this option, that is, to control per sample costs as well as turnaround times, it seems beneficial to use a strategy of lower sequencing coverage (ie, 1×–5× genomic coverage) with sample multiplexing to be cost-effective while carefully weighing the options of short-insert versus long-insert paired-end (i.e. mate-pair) library preparation.

While other recent studies have demonstrated that WGS, including low-coverage WGS, is effective for CNV detection in clinical samples,19–21 systematic, quantitative and direct performance comparisons for CNV analysis between various low-coverage WGS strategies, against deep-coverage WGS and against arrays, are needed to fully assess the feasibility of replacing arrays with low-coverage WGS and to guide researchers in their choices for specific settings. Here, we compared the CNV-detection performances of several low-coverage WGS strategies against each other and also against commercially available arrays. We performed CNV analysis in the genome of the 1000 Genome Project CEU subject NA12878 (probably the best studied genome to date3 22 23) using standard 350 bp short-insert WGS, 3 kb-insert mate-pair WGS and 5 kb-insert mate-pair WGS, each at 1×, 3× and 5× coverages. For benchmarking, we used a gold standard (GS) set of validated CNVs for NA12878 and determined the number of GS CNVs detected by each low-coverage WGS strategy. The GS set contains only high confidence CNVs derived from the 1000 Genomes Project and supported by multiple orthogonal methods.24 25 This approach was also used in Haraksingh et al 25 for the benchmarking of CNV detection in NA12878 from 17 commercially available arrays and thus allows for the performance comparison between low-coverage WGS and the arrays from Haraksingh et al 25 to be conducted in a direct and unbiased manner.

Methods

Sample library construction, sequencing, alignment, CNV analysis, array processing and NCBI accession numbers are described in the online Supplementary materials and methods.

Supplemental material

Results

CNV detection in WGS

From short-insert and mate-pair WGS of NA12878, we performed CNV detection using both read-depth analysis and discordant read-pair analysis (figure 1A). For read-depth analysis, we used CNVnator26 with 5 kb bin size. Discordant read-pair analysis was performed using LUMPY27 with segmental duplications excluded from the analysis. CNVs that overlap problematic regions such as reference gaps, the major histocompatibility complex cluster and Encyclopedia Of DNA Elements (ENCODE) blacklist regions28 were filtered out (see Methods). Afterwards, the union of CNV calls from both analyses was used as the final call set for benchmarking using the GS CNVs as well as comparison with the array calls.25 At 1×, 3× and 5× coverages, short-insert WGS detected 182, 405 and 535 autosomal CNVs, respectively; 3 kb mate-pair WGS detected 452, 689 and 747, respectively and 5 kb mate-pair WGS detected 496, 571 and 725, respectively (online Supplementary table S1, figure 1B).

Supplemental material

Figure 1

Comparisons of CNV calls by whole-genome sequencing (WGS) and arrays. (A) Schematic diagram of detection of CNVs (deletions and duplications) using discordant read-pair analysis and read-depth analysis.1 18 26 Using discordant read-pair analysis, deletions are detected when the distance of alignment to the reference genome between read pairs are closer than the expected insert size of the library, and duplications are detected when the orientation of the aligned read pairs are inversed. Using read-depth analysis, deletions and duplications are detected when there is a pronounced decrease and increase, respectively, in alignments of reads spanning a genomic region relative to the average number of alignments over the genome. (B) Numbers of autosomal CNVs in the genome of subject NA12878 called from short-insert, 3 kb mate-pair and 5 kb mate-pair libraries sequenced at 1×, 3× and 5× coverages compared against previous array calls.25 29 Array-based CNV calls were made according to platform-specific algorithms,25 and WGS CNV calls were made by combining discordant read-pair analysis and read-depth analysis. Gold: autosomal CNVs (overlap ≥50% reciprocally with NA12878 gold standard (GS) CNVs). Green: 10%–50% reciprocal overlap with NA12878 GS CNVs. Blue: <10% reciprocal overlap with GS CNVs, ≥50% reciprocal overlap with NA12878 silver standard CNVs. Red: no overlap (<10% overlap with GS CNVs and <50% overlap with silver standard CNVs). *Benchmarking was performed taking CNV type into account with the exception of Affymetrix SNP 6.029 and Illumina HumanOmni1Quad25 where CNV type information was not available.

CNV-detection performance comparison

We obtained the NA12878 CNV calls by each of 17 currently commercially available high-density oligonucleotide arrays from Haraksingh et al.25 These arrays represent three different technologies: array CGH (aCGH) from Agilent (n=5), SNP genotyping arrays from Illumina (n=10) and aCGH/SNP combination arrays from Affymetrix (n=2). Two technical replicates had been performed for each array hybridisation and CNVs were called using both array platform-specific software and platform-agnostic software Nexus from Biodiscovery except for Affymetrix SNP 6.0 where the platform-specific calls (one replicate available) were obtained from an earlier study.29

We benchmarked the CNV calls from short-insert and mate-pair WGS using the same approach as described in Haraksingh et al,25 where the capabilities of various array platforms were assessed by the numbers of detected CNVs in the NA12878 genome that reciprocally overlap a GS set of NA12878 CNVs. GS CNVs were compiled from 8×-coverage population-scale sequencing (data available on 1000genomes.org) and analysis of 2,504 individual genomes.24 They are of high confidence and supported by multiple lines of evidence that include PCR confirmation, aCGH and discovery from multiple CNV analysis tools. The false-positive rate is estimated to be very low (3.1%).25 The CNVs in this GS set range from 50 bp to 453,312 bp with 1,941 and 135 autosomal deletions and duplications, respectively.25 While most GS duplications are >10 kb, all CNVs <1 kb are deletions. Seven deletions are >100 kb (table 1). A NA12878 silver standard set of CNVs was also used for benchmarking (also obtained from Haraksingh et al 25) which consists of CNVs called using only CNVnator26 from 60×-coverage 2×250 bp short-insert sequencing data from the 1000 Genomes Project. For our analysis, we filtered out silver standard CNVs that overlapped reference gaps30 31 by >50% (online Supplementary figure S1).

Supplemental material

Table 1

Detection of gold standard deletions >100 kb

As was done for array CNV calls,25 the CNV calls from each WGS library at each sequencing coverage were benchmarked by first determining the number of CNVs with boundaries overlapping those of a GS CNV by ≥50% reciprocally and the number of CNVs that overlap of a GS CNV ≥10% reciprocally but <50%. From the CNVs that do not overlap a GS CNV by ≥10% reciprocally, we next determined the number that overlap a silver standard CNV by ≥50% reciprocally (online Supplementary table S1, figure 1B). For array data,25 because more CNV calls and GS-CNV overlaps resulted from the platform-specific CNV analysis overall, we chose to use the array platform-specific calls for comparison. Moreover, because the results in the two technical replicates for each array platform did not show significant differences and only one replicate was available for the Affymetrix SNP 6.0 platform-specific analysis,25 we chose to use CNV calls from the first replicate of each platform. We emphasise that we benchmarked all WGS and array CNV calls by taking the type of CNV (deletion or duplication) into account which was not done in Haraksingh et al.25

With the exception of short-insert WGS at 1× coverage, WGS detects drastically more CNVs and GS CNVs than any of the arrays (figure 1B). CNV calls from WGS are also accompanied by a smaller percentages of calls without validation (ie, CNV calls that are not of high confidence). Validation is hereinafter defined as overlap by a minimum of 10% reciprocally with a GS CNV or >50% reciprocally with a silver standard CNV. At coverages 1×, 3× and 5× and by >50% reciprocal overlap, short-insert WGS detected 48, 164 and 244 GS autosomal CNVs, respectively; 3 kb mate-pair WGS detected 250, 378 and 405 GS autosomal CNVs, respectively and 5 kb mate-pair WGS detected 270, 274 and 335 GS autosomal CNVs, respectively (online Supplementary table S1).

As expected, read-depth analysis resulted in higher rates of CNV detection with increasing coverage but showed similar numbers of CNV calls across different WGS libraries (online Supplementary figure S2A, Supplementary table S2). Discordant read-pair analysis resulted in consistently more CNV calls than read-depth analysis for mate-pair libraries at all coverages and for short-insert library at 5× coverage (online Supplementary figure S2B, online supplementary table S3). Generally, in read-depth analysis, the greater the fraction of uniquely mapped supporting reads, the higher the confidence in the CNV call. Filtering based on this parameter can be done through the q0 value reported by CNVnator26 (online Supplementary table S2). In Abyzov et al,26 the q0 threshold was set as 0.50 indicating that CNVs supported by >50% of reads with a mapping quality of zero were filtered out. To understand how such filtering affects our read-depth analysis, we benchmarked filtered (q0 threshold=0.50) and unfiltered CNV calls (online Supplementary figure S2A,C). We find that the overall number of GS CNVs detected did not noticeably change with the q0 filter. However, the number of silver standards detected dramatically decreases (online Supplementary figure S2C). In addition, the number of non-validated calls decrease less dramatically but very substantially, nonetheless (approximately 12%–30%), consistent with the average decrease in false-discovery rate for samples studied in Abyzov et al.26

Supplemental material

Supplemental material

Supplemental material

Of the arrays, Illumina HumanOmni1 Quad (now discontinued) detected the most GS CNVs (165); however, even at 1× coverage, 3 kb and 5 kb mate-pair WGS detected almost twice as many GS CNVs (275 and 290, respectively) (online Supplementary table S1, figure 1B). While Agilent 2×400K CNV and Illumina HumanOmni5Exome-4v1 detect comparable numbers of CNV to that of short-insert WGS at 3× coverage and to that of mate-pair WGS at 1× coverage, the vast majority of CNVs detected in these two arrays are not validated (figure 1B, online Supplementary figure S3). This is in contrast to low-coverage WGS results where the majority of CNVs detected for all libraries and at all sequencing coverages have validation (figure 1B, online Supplementary figure S3, Supplementary table S1).

Supplemental material

WGS (3 kb mate-pair) at 5× coverage results in the most number of GS CNVs detected (429) and also the lowest percentage without validation (28.5%) (figure 1B, online Supplementary figure S3, Supplementary table S1). The percentages of CNV calls without validation range from 28.5% to 37.1% for mate-pair WGS and 33.6%–48.4% for short-insert WGS (online Supplementary figure S4, Supplementary table S1). These percentages for WGS are smaller than those for most arrays except Illumina HumanOmniExpress and HumanOmni25 arrays (online Supplementary figure S3). These two arrays, however, made a very low number of calls (15 and 31, respectively) compared with WGS (figure 1B, online Supplementary table S1). Affymetrix SNP 6.0 also has a higher overall validation rate than that of WGS, but information for whether a CNV call is a deletion or duplication is not available for this array data set.25 29 It is uncertain how its validation rate will change if this information can be taken into account (as for the WGS and other array data sets).

Supplemental material

When deletions and duplications are analysed separately, the validation rates for deletions are higher than for duplications (56.1%–75.6% vs 34.8%–49.4%) in WGS, whereas arrays show a much wider variability (online Supplementary figure S4, online Supplementary table S1). For deletions, the Agilent CGH arrays, Affymetrix CytoScanHD and the Illumina CytoSNP, HumanCoreExome, HumanOmniExpressExome-8v1, and Psych arrays have validation rates between 15% and 50%. The Illumina HumanOmni arrays (except HumanOmniExpressExome-8v1) have validation rates between 54.9% and 83.3%, but none of these arrays detected more than 70 deletions. For duplications, the Agilent CGH, HumanOmni5Exome, and Affymetrix CytoScanHD have validation rates between <1% and 20%. All other arrays detected no more than seven duplications. The Illumina CytoSNP, HumanCoreExome, HumanOmni25-8v1 and Psych arrays detected 4, 1, 5, and 2 duplications, respectively, and all were validated (online Supplementary table S1). The HumanOmni arrays (except HumanOmni5Exome) have validation rates between 50.0% and 85.7% detecting duplications ranging from 2 to 7. For both deletions and duplications, the numbers of GS and silver standard CNVs detected and the overall validation rates are much higher for WGS compared with the arrays (online Supplementary figure S4).

Sensitivity of GS CNV detection

Overall, compared with arrays, low-coverage WGS is able to detect on average approximately fivefold more GS CNVs (>1 kb) by ≥50% reciprocal overlap (figure 2A). For example, the most sensitive array (as measured by the number of GS CNVs detected), Illumina HumanOmni1 Quad (now discontinued), detects less than one-third as many GS CNVs than the most sensitive WGS method (3 kb mate-pair at 5× coverage). The second least sensitive WGS method (short-insert at 3× coverage) is still more sensitive than Illumina HumanOmni1 Quad. Even the least-sensitive WGS method (short-insert at 1× coverage) is more sensitive than 14 out of the 17 arrays. Moreover, mate-pair WGS is able to detect >50% of GS CNVs in the 5 kb–10 kb size range (figure 2B-D).

Figure 2

Sensitivity of whole-genome sequencing (WGS) detection of NA12878 GS CNVs (>1 kb). (A) Sensitivity of (>1 kb) GS CNV detection across WGS libraries and array platforms as determined by the ratio of detected autosomal GS CNVs to total number of autosomal GS CNVS. Array-based CNV calls were made according to platform-specific algorithms, and WGS CNV calls were made by combining discordant read-pair analysis and read-depth analysis. Green: percentage of total autosomal GS CNVs detected (overlapping by >50% reciprocally). Light Blue: percentage of total autosomal GS CNVs not detected (non-overlapping by >50% reciprocally). Sensitivity of GS CNV detection in different size ranges from (B) short-insert, (C) 3 kb mate-pair and (D) 5 kb mate-pair libraries at sequencing coverages 1×, 3× and 5×. CNVs were called by combining discordant read-pair analysis and read-depth analysis.

While CNV detection increases with additional coverage for all WGS libraries, the increase is non-linear. The highest increases are from 1× to 3× coverage for short-insert and 3kb-mate-pair WGS (figure 1B). While more CNVs are consistently detected from mate-pair WGS compared with short-insert WGS, interestingly, more total CNVs and GS CNVs are detected from 3 kb mate-pair WGS than from 5 kb mate-pair WGS at 3× and 5× coverages, respectively (figure 1B, figure 2A). In addition, while additional coverage is associated with overall increases in the detection of GS CNVs, this increase is less obvious as CNV sizes increase to >50 kb (figure 2B-D). In short-insert WGS, with additional coverage, the most drastic gains in the GS CNVs detected are between 5 kb to 50 kb (figure 2B). This is similar for mate-pair WGS though less pronounced (figure 2C, D).

Size distribution of CNV calls

The sizes of NA12878 CNVs detected from short-insert and mate-pair WGS range from 100 bp to 500 kb and 1 kb to 500 kb, respectively (figure 3A–C, online Supplementary tables S2–S4). Read-depth analysis and discordant read-pair analysis detect CNVs in different size ranges (figure 3D–I, online Supplementary figure S5). Overall, WGS detects CNVs in a wider size distribution compared with arrays (online Supplementary figure S5). Since bin size was set to 5 kb (see the Methods section), all resulting CNVs detected are ≥5 kb from read-depth analysis (figure 3D–F). As expected, the size distributions of CNVs called by read-depth analysis are very similar (figure 3D–F) for the various WGS libraries, whereas discordant read-pair analysis shows more variability (figure 3G–I) reflecting the different insert sizes. A greater proportion of CNVs <5 kb are detected in 3 kb mate-pair compared with 5 kb mate-pair WGS (figure 3B,C). This is likely because for longer insert sizes, the experimental variability in size selection during library preparation makes for increased uncertainty in calling smaller CNVs. At the same time, a longer insert size has a greater ability to span CNV boundaries, yielding higher physical coverage, thus increasing overall detection power, especially for CNVs in the medium to large size range. Furthermore, the increases in total CNV detection as a result of increasing coverage are mainly for CNVs <50 kb in both read-depth analysis and discordant read-pair analyses (figure 3D–I).

Supplemental material

Figure 3

Size distributions of NA12878 detected by whole-genome sequencing (WGS). Size distribution of NA12878 CNVs detected from (A) short-insert, (B) 3 kb mate-pair and (C) 5 kb mate-pair libraries called by combining discordant read-pair analysis and read-depth analysis. CNVs called from (D) short-insert (E) 3 kb mate-pair and (F) 5 kb mate-pair libraries by read-depth analysis only. CNVs called from (G) short-insert, (H) 3 kb mate-pair and (I) 5 kb mate-pair libraries by discordant read-pair analysis only.

All seven GS deletions >100 kb are detected by WGS ≥1× coverage

All seven GS deletions >100 kb are detected with all WGS libraries ≥1× coverage. Six deletions are detected with >50% reciprocal overlap, and one deletion (chr6: 78,892,808–79,053,430) is detected with a mean overlap=44% (online Supplementary table S5). Previously, it had been shown that only one of these deletions (chr19:20,595,835–20,717,950) is detected by most arrays25 (table 1). Affymetrix SNP 6.0 and Affymetrix CytoScanHD perform the best out of the 17 arrays detecting six and five of these deletions, respectively (table 1). Four out of these seven deletions are detected in Agilent 2×400 CGH but as high copy duplications. The GS deletion on chromosome 3 (chr3: 162,514,471–162,625,647) is only detected in the Agilent CGH arrays but consistently miscalled as a duplication (Supplementary Discussion).

Supplemental material

Detection of the 15 Mbp Cri du chat deletion by read-depth analysis

As a vignette with immediate clinical relevance, we also demonstrate that a much larger CNV, the 15 Mbp Cri du chat deletion spanning from 5p15.31 to 5p14.2 in NA16595 (sample from Coriell Institute), can be readily detected in a short-insert WGS library at coverages of 1×, 3× and 5× using read-depth analysis only (online Supplementary table S6). It can also be easily visualised at all coverages by the substantial drop in read-depth coverage in Integrative Genomics Viewer32 (figure 4). This NA16595 Cri du chat deletion is also confirmed using the Illumina Multi-Ethnic Genotyping Array with two technical replicates (online Supplementary table S6).

Supplemental material

Figure 4

NA16595 Cri du chat deletion. Integrative Genomics Viewer32 screenshot of the 15 Mbp Cri du chat deletion on chromosome 5 in NA16595 by short-insert whole-genome sequencing (WGS) at 1×, 3× and 5× coverages. Vertical axis: coverage value, blue dots: respective coverage at genomic positions. Two areas within the deletion show unusually high coverage due to overlap with segmental duplications resulting in cross-mapping of sequencing reads.

Discussion

Using the CNVs in NA12878 as a benchmark, we systematically compared the CNV detection performances of low-coverage WGS strategies relative to each other and relative to various arrays currently in routine use for cytogenetics. CNVs were called using standard methods for both low-coverage WGS and arrays and then compared with a list of NA12878 GS CNVs that had been distilled from the 1000 Genomes Project24 as well as to a set of silver standard CNVs generated from 1000 Genomes Project 60×-coverage WGS data (2×250 bp, short insert).25 The silver standard CNVs were called using CNVnator26 which is also used in this study for read-depth analysis of low-coverage WGS data. This further increases the direct comparability of CNV-calling efficiency across the range of coverages though a certain bias in favour of WGS is therefore present in the Silver Standard-based parts of the performance comparison. In almost all scenarios, the WGS approaches show considerably higher sensitivities at detecting GS CNVs than even the best-performing arrays (figure 2A) and are furthermore accompanied by lower percentages of total CNV calls without validation (online Supplementary figure S3, figure 1B). While all methods of CNV detection left >80% of total GS CNVs undetected, this can be largely explained by that 63% of GS CNVs are <1 kb and 54% are <500 bp which are outside the sensitive detection ranges for all methods (figure 2B–D, online Supplementary figure S5).

Twenty autosomal GS CNVs were detected collectively in the 17 arrays but not by low-coverage WGS, whereas 426 were detected by low-coverage WGS but not by any of the arrays (online Supplementary table S7). Of these 20 GS CNVs, four are approximately 1 kb or shorter, where the sensitivity of detection is low (figure 2B–D); 13 (65%) are in regions excluded from WGS analysis. These regions include segmental duplications, the MHC cluster, regions that are different between hg19 and hg38 (likely misassembled regions in hg19) and regions in the ENCODE blacklist (regions that often produce artificially high coverage due to excessive unstructured anomalous mapping).28 The remaining three CNVs do not fall into any of these categories. It is unclear why these three CNVs were not detected with WGS, but by visual inspection, their boundaries lie within repetitive elements, that is, LINE1, SINE, segmental duplications. One of these three CNVs (chr1: 248756741–248797597) was called as a larger deletion (chr1: 248692001–248820000) in WGS. It is likely that the size of this deletion call was extended due to noisy coverage signal from its flanking segmental duplication regions.

Supplemental material

One may be tempted to conclude that in certain cases, array-based techniques are superior for CNV detection since there are indeed 20 GS CNVs that were not detected in WGS. However, it is also important to note that these 20 CNVs were detected by combining 17 arrays and that these CNVs were elusive to detection from low-coverage WGS largely due to small size and to occurring in problematic regions excluded from analysis. However, these genomic features are also problematic for arrays in most cases. Our analysis shows that no single array platform or design is specifically sensitive for detecting CNVs that are associated with these features. Therefore, while there are a few specific cases in which CNVs are detected in arrays and not in WGS, we do not see a scenario for which one can make a general statement that array-based techniques are superior for detecting CNVs associated with particular genomic characteristics. In any case, our results show that it is >20 times more likely that a CNV is detected in low-coverage WGS and not in any arrays (online Supplementary table S7)

Although CNV detection methods from WGS data have been available for up to a decade,1 26 33 34 cost, long turnaround times and heavy computational requirements for deep-coverage WGS analysis have been major obstacles that prevented the adoption of WGS-based methods for cytogenetic applications. Our comparative analysis here shows that these obstacles can be overcome by adopting low-coverage WGS strategies. This means that a cytogenetics laboratory can now avail itself of a technology with a CNV detection and resolving power that compares very favourably with existing standard methodologies such as arrays or karyotyping while not having to accept an increased burden in terms of cost per sample or turnaround time. Our WGS libraries were sequenced on the Illumina NextSeq 500, in multiplexed fashion, at a cost in sequencing consumables of approximately US$150 per 1× coverage. The short-insert libraries were prepared using the Kapa Hyper Prep kit (Kapa Biosystems) where the cost per library is approximately US$40–US$50 with >50 ng of genomic DNA needed as input for a high-complexity library. The mate-pair library construction reagents cost ~US$300 per library using the Illumina Mate Pair Library Prep kit requiring 2–4 µg of high molecular weight genomic DNA (mean size >20 kb). The mate-pair library construction costs can decrease further to approximately US$50–US$80 per library if mechanical DNA shearing is employed instead of enzymatic shearing.35 The costs for arrays are more variable (<US$100 for the Illumina PsychArray to several hundred dollars for higher density arrays). Preparation time for arrays (labelling, hybridisation, washing and scanning) and low-coverage short-insert sequencing (library construction, quantification and loading onto sequencer) both take approximately 2 days; mate-pair libraries require an additional day. The analyses can all be performed on a standard desktop computer.

The amount of input genomic DNA required for mate-pair WGS is approximately twofold more than for most arrays. However, as long as this amount of DNA is available, it could be reasoned that it is preferable to use mate-pair WGS for CNV analysis instead of arrays, considering, for example, that mate-pair WGS even at just 1× coverage is much more sensitive at detecting CNVs than all currently used arrays. For samples with limited DNA, short-insert WGS at just 3× coverage (at a cost of circa US$350 per sample using bench-top instruments such as the Illumina NextSeq 500) is still as effective (if not more so in terms of cost and effort) as arrays while easily outperforming arrays in the ability to detect and resolve CNVs. It should also be taken into consideration that sequencing costs will only continue to decrease, thus rendering the use of WGS for CNV detection even more cost-effective in the foreseeable future.

The choice of algorithm used for CNV analysis is likely to greatly impact the number and accuracy of CNV calls.34 36 A comprehensive comparative analysis of these algorithms is beyond the scope of this study, though work on this matter has been discussed extensively in recent publications.33 37–39 For in-depth discussions of CNV analysis tools, approaches, parameters and challenges as well as performance comparisons with 30× coverage WGS (online Supplementary figure S6, online Supplementary table S1), see online supplementary discussion.

Supplemental material

Overall, low-coverage WGS approaches are drastically more sensitive at detecting CNVs compared with the best-performing arrays (currently commercially available) and are accompanied with smaller percentages of calls without validation. The prospect of replacing arrays with low-coverage WGS in a cytogenetic context seems promising and essentially at hand. Our results will contribute to the discussion on when and via which route this transition from using arrays to WGS will be plausible in cytogenetics practice.

Supplemental material

Supplemental material

Supplemental material

Acknowledgments

We thank Dr Billy T Lau at the Stanford Genome Technology Center for performing the pilot round of library sequencing on the Illumina NextSeq 500.

References

View Abstract

Footnotes

  • Contributors BZ and AEU conceived and designed the study. BZ and RP performed the experiments. BZ designed the analysis pipeline. BZ, SSH and XZ performed the analysis. RRH contributed code. BZ, SSH and AEU wrote the manuscript.

  • Funding This work was supported by the Stanford Medicine Faculty Innovation Program and from the National Institutes of Health NHGRI grant P50 HG007735. BZ was additionally funded by NIH Grant T32 HL110952.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Correction notice This article has been corrected since it was published Online First. RRH’s affiliation has been corrected.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.