Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Technical Report
  • Published:

Ancestry estimation and control of population stratification for sequence-based association studies

Abstract

Estimating individual ancestry is important in genetic association studies where population structure leads to false positive signals, although assigning ancestry remains challenging with targeted sequence data. We propose a new method for the accurate estimation of individual genetic ancestry, based on direct analysis of off-target sequence reads, and implement our method in the publicly available LASER software. We validate the method using simulated and empirical data and show that the method can accurately infer worldwide continental ancestry when used with sequencing data sets with whole-genome shotgun coverage as low as 0.001×. For estimates of fine-scale ancestry within Europe, the method performs well with coverage of 0.1×. On an even finer scale, the method improves discrimination between exome-sequenced study participants originating from different provinces within Finland. Finally, we show that our method can be used to improve case-control matching in genetic association studies and to reduce the risk of spurious findings due to population structure.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Graphic illustration of the LASER method.
Figure 2: Estimation of worldwide continental ancestry.
Figure 3: Estimation of ancestry within Europe.
Figure 4: Estimation of fine-scale ancestry within Finland.

Similar content being viewed by others

References

  1. Altshuler, D., Daly, M.J. & Lander, E.S. Genetic mapping in human disease. Science 322, 881–888 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369 (2008).

    Article  CAS  PubMed  Google Scholar 

  3. Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).

    Article  CAS  PubMed  Google Scholar 

  4. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106, 9362–9367 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Coventry, A. et al. Deep resequencing reveals excess rare recent variants consistent with explosive population growth. Nat. Commun. 1, 131 (2010).

    Article  PubMed  Google Scholar 

  6. Mamanova, L. et al. Target-enrichment strategies for next-generation sequencing. Nat. Methods 7, 111–118 (2010).

    Article  CAS  PubMed  Google Scholar 

  7. Bamshad, M.J. et al. Exome sequencing as a tool for Mendelian disease gene discovery. Nat. Rev. Genet. 12, 745–755 (2011).

    Article  CAS  PubMed  Google Scholar 

  8. Shen, P. et al. High-quality DNA sequence capture of 524 disease candidate genes. Proc. Natl. Acad. Sci. USA 108, 6549–6554 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Nelson, M.R. et al. An abundance of rare functional variants in 202 drug target genes sequenced in 14,002 people. Science 337, 100–104 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Nejentsev, S., Walker, N., Riches, D., Egholm, M. & Todd, J.A. Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324, 387–389 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Rivas, M.A. et al. Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. 43, 1066–1073 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Raychaudhuri, S. et al. A rare penetrant mutation in CFH confers high risk of age-related macular degeneration. Nat. Genet. 43, 1232–1236 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Cardon, L.R. & Palmer, L.J. Population stratification and spurious allelic association. Lancet 361, 598–604 (2003).

    Article  PubMed  Google Scholar 

  14. Marchini, J., Cardon, L.R., Phillips, M.S. & Donnelly, P. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).

    Article  CAS  PubMed  Google Scholar 

  15. Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).

    Article  CAS  PubMed  Google Scholar 

  16. Mathieson, I. & McVean, G. Differential confounding of rare and common variants in spatially structured populations. Nat. Genet. 44, 243–246 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Clark, M.J. et al. Performance comparison of exome DNA sequencing technologies. Nat. Biotechnol. 29, 908–914 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Li, Y., Sidore, C., Kang, H.M., Boehnke, M. & Abecasis, G.R. Low-coverage sequencing: implications for design of complex trait association studies. Genome Res. 21, 940–951 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Le, S.Q. & Durbin, R. SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples. Genome Res. 21, 952–960 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Pasaniuc, B. et al. Extremely low-coverage sequencing and imputation increases power for genome-wide association studies. Nat. Genet. 44, 631–635 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    Article  CAS  PubMed  Google Scholar 

  22. Price, A.L., Zaitlen, N.A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Li, J.Z. et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 319, 1100–1104 (2008).

    Article  CAS  PubMed  Google Scholar 

  24. Novembre, J. et al. Genes mirror geography within Europe. Nature 456, 98–101 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Schönemann, P.H. & Carroll, R.M. Fitting one matrix to another under choice of a central dilation and a rigid motion. Psychometrika 35, 245–255 (1970).

    Article  Google Scholar 

  26. Wang, C. et al. Comparing spatial maps of human population-genetic variation using Procrustes analysis. Stat. Appl. Genet. Mol. Biol. 9, 13 (2010).

    CAS  PubMed Central  Google Scholar 

  27. 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (2010).

  28. Zhan, X. et al. Identification of a rare coding variant in complement 3 associated with age-related macular degeneration. Nat. Genet. 45, 1375–1379 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (2010).

  30. Chen, W. et al. Genetic variants near TIMP3 and high-density lipoprotein–associated loci influence susceptibility to age-related macular degeneration. Proc. Natl. Acad. Sci. USA 107, 7401–7406 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Valle, T. et al. Mapping genes for NIDDM. Design of the Finland–United States Investigation of NIDDM Genetics (FUSION) Study. Diabetes Care 21, 949–958 (1998).

    Article  CAS  PubMed  Google Scholar 

  32. Fu, W. et al. Analysis of 6,515 exomes reveals the recent origin of most human protein-coding variants. Nature 493, 216–220 (2013).

    Article  CAS  PubMed  Google Scholar 

  33. Guan, W., Liang, L., Boehnke, M. & Abecasis, G.R. Genotype-based matching to correct for population stratification in large-scale case-control genetic association studies. Genet. Epidemiol. 33, 508–517 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  34. Wang, C., Zöllner, S. & Rosenberg, N.A. A quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genet. 8, e1002886 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet. 2, e190 (2006).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Miclaus, K., Wolfinger, R. & Czika, W. SNP selection and multidimensional scaling to quantify population structure. Genet. Epidemiol. 33, 488–496 (2009).

    Article  PubMed  Google Scholar 

  37. Zhu, C. & Yu, J. Nonmetric multidimensional scaling corrects for population structure in association mapping with different sample types. Genetics 182, 875–888 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  38. Yang, W.Y., Novembre, J., Eskin, E. & Halperin, E. A model-based approach for analysis of spatial structure in genetic data. Nat. Genet. 44, 725–731 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Fromer, M. et al. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth. Am. J. Hum. Genet. 91, 597–607 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Krumm, N. et al. Copy number variation detection and genotyping from exome sequence data. Genome Res. 22, 1525–1532 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  41. Kang, H.M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833–835 (2011).

    Article  CAS  PubMed  Google Scholar 

  43. Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Zhang, S., Zhu, X. & Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. 24, 44–56 (2003).

    Article  PubMed  Google Scholar 

  45. Nelson, M.R. et al. The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. Am. J. Hum. Genet. 83, 347–358 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Jun, G. et al. Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data. Am. J. Hum. Genet. 91, 839–848 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  49. Skoglund, P. et al. Origins and genetic legacy of Neolithic farmers and hunter-gatherers in Europe. Science 336, 466–469 (2012).

    Article  CAS  PubMed  Google Scholar 

  50. Holsinger, K.E. & Weir, B.S. Genetics in geographically structured populations: defining, estimating and interpreting FST . Nat. Rev. Genet. 10, 639–650 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Hudson, R.R. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18, 337–338 (2002).

    Article  CAS  PubMed  Google Scholar 

  52. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank investigators from the FUSION study and the GoT2D Sequencing Project for generously sharing whole-genome and deep exome sequence data for 941 individuals before publication and the D2D, Finrisk 2002, Health 2000, Action LADA and Saviatipale studies for providing some of the FUSION-sequenced DNA. We thank J.Z. Li for his assistance with the HGDP data set, H. Stringham and A. Locke for assistance with the FUSION data set and M. Brooks for organizing the macular degeneration samples. C.W. acknowledges funding support from a Howard Hughes Medical Institute International Student Research Fellowship. This study is supported by the US National Institutes of Health (DK062370, HG000376, HG005552, HG006513, EY022005, HG007022, HG005855, HG003079, CA076404 and CA134294) and by the National Eye Institute Intramural Research Program.

Author information

Authors and Affiliations

Authors

Consortia

Contributions

C.W., X.Z., S.Z. and G.R.A. conceived and implemented the approach. X.L. provided critical feedback on methodology and simulations. J.B.-G., D.S., E.Y.C., K.E.B., J.H., R.F., R.K.W., E.R.M. and A.S. contributed the macular degeneration targeted sequencing data. H.M.K. and FUSION collaborators contributed the Finnish exome sequence data. C.W. and G.R.A. wrote the first draft of the manuscript. All authors reviewed, revised and contributed critical feedback to the manuscript and presentation.

Corresponding authors

Correspondence to Chaolong Wang or Gonçalo R Abecasis.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Additional information

Full lists of members and affiliations appear in the Supplementary Note.

Integrated supplementary information

Supplementary Figure 1 Off-target coverage for 410 samples from the 1000 Genomes exon project.

The off-target coverage for each sample is calculated by averaging across 632,958 loci in HGDP. For 270 loci that appear in the targeted regions, we set the coverage at these loci to 0 for all samples. Mean off-target coverage is 0.096× across the HGDP loci.

Supplementary Figure 2 Estimation of worldwide ancestry for 410 samples in the 1000 Genomes exon project.

The SNP genotypes of these samples are from the HapMap Project. We used all HGDP individuals as the reference panel, as labeled by colored points. (A,B) Results based on SNPs that were genotyped in both HapMap 3 and HGDP. (C,D) Results based on off-target sequence data. Procrustes similarity to the SNP-based coordinates is t0 = 0.9955, r2 = 0.9950, 0.9871, 0.9439 and 0.7747 for PC1, PC2, PC3 and PC4, respectively.

Supplementary Figure 3 Off-target coverage for 3,159 samples from the AMD study.

The red line indicates off-target coverage averaged across the 632,958 loci included in HGDP. The blue line indicates off-target coverage averaged across the 318,682 loci that are included in POPRES. For loci that appear in the targeted regions, we set the coverage at these loci to 0 for all samples, including 215 loci in HGDP and 113 loci in POPRES. Mean off-target coverage is 0.224× across the HGDP loci and 0.241× across the POPRES loci.

Supplementary Figure 4 Estimation of ancestry for 3,159 samples in the AMD targeted sequencing data set.

(A,B) Results based on the HGDP reference panel, whose colors and symbols follow Supplementary Figure 2. AMD samples are displayed in black, with different symbols representing possible ancestries based on their estimated PC coordinates. Two HapMap trios are labeled in gray. (C,D) Results based on the POPRES reference panel. Panel C displays PC1 and PC2 of POPRES; panel D displays 3,072 AMD samples on top of the POPRES samples. These samples are possibly Europeans or Middle Eastern ancestry, as indicated in panels A and B. Population labels for the POPRES samples are as follows: AL, Albania; AT, Austria; BA, Bosnia and Herzegovina; BE, Belgium; BG, Bulgaria; CH-F, Swiss French; CH-G, Swiss German; CH-I, Swiss Italian; CY, Cyprus; CZ, Czech Republic; DE, Germany; DK, Denmark; ES, Spain; FI, Finland; FR, France; GB, United Kingdom; GR, Greece; HR, Croatia; HU, Hungary; IE, Ireland; IT, Italy; KS, Kosovo; LV, Latvia; MK, Macedonia; NL, Netherlands; NO, Norway; PL, Poland; PT, Portugal; RO, Romania; RU, Russia; Sct, Scotland; SE, Sweden; SI, Slovenia; SK, Slovakia; TR, Turkey; UA, Ukraine; YG, Serbia and Montenegro.

Supplementary Figure 5 Sequence-based coordinates and SNP-based coordinates for 931 AMD samples when using the HGDP reference panel.

Colors and symbols for HGDP and AMD samples follow Supplementary Figure 2. (A,B) Results based on 45,700 SNPs that are shared by the HGDP, POPRES and AMD SNP datasets. (C,D) Results based on off-target sequence data. The Procrustes similarity between results in panels A and B and in panels C and D is t0 = 0.9068. r2 = 0.9104, 0.8881, 0.6031 and 0.1828 for PC1, PC2, PC3 and PC4, respectively.

Supplementary Figure 6 Sequence-based coordinates and SNP-based coordinates for AMD samples when using the POPRES reference panel.

We only included 928 AMD samples whose genotype data are available and who might be European or Middle Eastern according to results in Supplementary Figure 5. (A) Results based on 45,700 SNPs that are shared by the HGDP, POPRES and AMD SNP data sets. (B) Results based on off-target sequence data. The Procrustes similarity between results in panels A and B is t0 = 0.9209. r2 = 0.9557 and 0.6389 for PC1 and PC2, respectively.

Supplementary Figure 7 Results for simulated exome sequencing data for 385 POPRES samples.

(A) Coordinates estimated from SNP genotypes at 2,547 on-target loci. Procrustes similarity to the SNP-based coordinates in Figure 3A is t0 = 0.5031. (B) Coordinates estimated based on off-target sequence reads (t0 = 0.9467). (C) Coordinates estimated based on sequence reads from both off-target and on-target regions (t0 = 0.9669). Mean coverage is ~88.9× and ~1.0× for on-target and off-target regions.

Supplementary Figure 8 Different strategies for sampling 1,280 cases.

(A) Sampling from two 8 × 8 grids along one side, with ten cases from each grid point. (B) Sampling from two 8 × 8 grids along the diagonal, with ten cases from each grid point. (C) Sampling from one 8 × 8 grid at the corner, with 20 cases from each grid point. (D) Sampling from one 8 × 8 grid at the center, with 20 cases from each grid point.

Supplementary Figure 9 Improvement of estimation by using coordinates averaged across multiple runs of LASER on the same data set.

The x axis indicates the number of runs used in calculating mean PC coordinates. The y axis indicates Procrustes similarity t0 between the mean coordinates and the SNP-based coordinates. Each box represents the distribution of t0 obtained from 15 repeating runs. (A) Results on sequence data of worldwide samples simulated from the genotypes of 238 HGDP individuals, using the other 700 HGDP individuals as the reference panel. We tested on three simulated data sets with coverage of 0.001×, 0.002× and 0.004×. (B) Results on sequence data of European samples simulated from the genotypes of 385 POPRES individuals, using the other 1,000 POPRES individuals as the reference panel. We tested on three simulated data sets with coverage of 0.10×, 0.20× and 0.40×. We only used one iteration in our examples of the 1000 Genomes and AMD targeted sequencing data because most samples have relatively high off-target coverage, such that improvement by using multiple iterations is small.

Supplementary Figure 10 Data processing procedures for the HGDP and POPRES data sets.

(A) The HGDP data set. (B) The POPRES data set.

Supplementary Figure 11 Data processing procedures for the HapMap 3 and AMD SNP data sets.

(A) The HapMap 3 data set. (B) The AMD SNP data set.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–11, Supplementary Tables 1–9 and Supplementary Note (PDF 3687 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Zhan, X., Bragg-Gresham, J. et al. Ancestry estimation and control of population stratification for sequence-based association studies. Nat Genet 46, 409–415 (2014). https://doi.org/10.1038/ng.2924

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng.2924

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing