Introduction

As a consequence of the generation of large numbers of genotypes in both family- and population-based genetic studies, much work is currently focusing on the identification and possible integration of error within such data. Error can be introduced into genetic data sets from a variety of sources, which include inconsistencies within family pedigrees,1,2 sample mishandling,3 and errors introduced by the genotyping process itself.4

Inclusion of incorrect data in genetic analysis can lead to the generation of false conclusions5,6 and a reduction of power to fine map trait loci,7,8 and is a recognised problem in the statistical analysis of genetic data sets.9,10,11,12,13 Previous simulation studies have considered the impact of genotyping errors in data sets generated from pedigrees;5,10 even genotyping error rates as low as 1–2% can affect both linkage and sib-pair studies.5 Within family studies, incorrect genotypes may inflate map distance between markers and also reduce the power to detect linkage,7,14,15,16 and contribute to an inflated false positive rate among transmission disequilibrium test (TDT)-derived associations.17

Within family studies, a proportion of genotyping errors can be detected by incorporating checks for consistency with Mendelian inheritance.2,9,18 Checking for Mendelian inconsistency will not, however, exclude all genotyping errors, and cannot be applied to population-based studies. The presence of errors within genotyping data sets generated from unrelated individuals has considerable impact on subsequent data analysis, as no checks for Mendelian consistency are possible within such data sets.19 It has also been demonstrated in a simulation study that genotyping error rates as low as 3% can adversely affect linkage disequilibrium (LD) measures.20 This could limit attempts to identify complex disease genes, because it has been demonstrated that genotyping errors always decrease the power of certain statistical tests for linkage and/or association. For example, the χ2 test of independence applied to case:control data always loses power in the presence of genotyping errors.8,21,22

Statistical tools that are able to take error into account have been developed. The majority of these models are applicable to linkage studies23,24,25,26 or TDTs.17,27 Genotyping error within data sets generated from unrelated individuals is also currently being addressed within certain statistical models.19,28

This study used the Hardy–Weinberg equilibrium (HWE) test to identify genotyping error within population-based data sets. In large enough randomly mating populations, not subject to genetic and population parameters affecting allele frequencies, the genotypes for an individual marker should distribute according to the principle of HWE.29 Technical reasons, such as assay nonspecificity and genotyping errors, can also impact on the distribution of genotypes for any one marker. These technical reasons for distribution deviation were explored in this study. As a result, an improved genotyping process was implemented to reduce the occurrence of genotype errors.

Subjects and methods

DNA samples

A total of 2750 Caucasian samples have been employed in this analysis. A total of 2008 of these samples have already been reported elsewhere.30,31,32 In addition, 588 samples were collated from North European Caucasians within GlaxoSmithKline with consent for nonidentified genotyping. In all, 92 Caucasian DNA samples were collected from North America with informed consent for nonidentified genotyping, and 62 Caucasian DNA samples were purchased from Coriell cell repositories (Camden, New Jersey, USA) (Table 1).

Table 1 Number of SNPs and samples used for each of the four technologies

Genotyping

A total of 443 single-nucleotide polymorphisms (SNPs) were typed using different sets of the DNA samples and different methodologies (Table 1) generating 107 068 genotypes.

Determination of deviation of genotype distribution from HWE

Minor allele frequencies were recorded for all of the 443 SNPs (Table 2). A subset of 313 SNPs, whose minor allele frequencies were >5%, was analysed for deviation of genotype distribution from HWE, using the χ2 statistic with one degree of freedom. SNPs whose genotype distribution deviated from HWE (P<0.05) were identified (Table 3), and are also referred to as HWD SNPs.

Table 2 Distribution of minor allele frequencies of 443 SNPs
Table 3 SNPs (36) exhibiting HWD (P<0.05)

Assay specificity

Nucleotide homology searches were performed on the primer and probe sequences defining reaction specificity for each of the SNPs deviating from HWE (P<0.05), using the NRNUC and NRHTG (Human Genome from EMBL, Sanger Centre and Washington University) databases in March 2002.

Results

In all, 107068 genotypes were generated from 443 SNPs, with minor allele frequencies ranging between 0.002 and 0.49 (Table 2). Of the 443 SNPs, 81% (353/443) were distributed throughout the genome. Of the remaining 90 SNPs, 38 mapped to a 400 kb, region on chromosome 22,32 24 to a region on chromosome 19,30 and 28 to a region on chromosome 3.31 A subset of 313 SNPs, whose minor allele frequencies were >5%, was selected for the estimation of deviation from HWE. A total of 36 SNPs (11.5%), with minor allele frequencies ranging between 0.06 and 0.49, were found to deviate from HWE (P<0.05) (Table 3). Of the 36 SNPs demonstrating HWD, 20 displayed deviation from HWE at the P<0.01 level (Table 3). Controlling the false discovery rate,33 16 of them would be considered significant at the 5% level.

Possible explanations for SNPs that showed deviation from HWE were explored. An SNP assay was classified as ‘nonspecific’ if a primer and/or probe set showed 100% homology with multiple regions in the genome. Five of the 36 SNPs (14%) were found to have ‘nonspecific’ assays. Genotyping errors were identified in 21 assays, accounting for 58% of the SNPs showing deviation from HWE. For the remaining 10 SNPs (28%), no reasons for the observed HWD were identified. When analysing assays that deviated from HWE at the P<0.01 level (Table 4), the percentage of SNPs associated with genotyping errors was slightly lower, and the proportion associated with nonspecificity slightly higher, than all HWD (P<0.05) assays.

Table 4 Identifiable reasons for deviation from HWE in 36 SNPs

For the 21 assays where deviation of genotype distribution from HWE (P<0.05) was due to genotyping error (Table 4), the data sets were stratified according to each genotyping technology used (Table 5), and according to the level of deviation from HWE (0.01<P<0.05, or P<0.01). Sources of error appeared to be dependent, at least in part, on the methodology used to type the SNP. One type of error seen in SNPs analysed by directly sequencing PCR products was the inability to distinguish accurately genotypes if the background signal was too high on the sequence trace. An error frequently associated with data generated using Taqman methodology was the inaccurate calling of individual genotypes if those individual genotypes fell between the three main genotype clusters.

Table 5 Stratification of 21 HWD (P<0.05) assays due to genotyping error by methodology

In general, the proportion of assays associated with genotyping error is slightly lower (2.9%) in the P<0.01 assays than in the 0.01< P<0.05 assays (3.8%), although following this stratification (Table 5), the numbers of assays studied are low. A greater proportion of RFLP assays appear to harbour genotyping error, but the number of assays studied (10) was low.

Discussion

Generation and analysis of large SNP genotyping data sets for the investigation of human complex disorders is currently the subject of much discussion, and focus for activity.34 Large genotyping data sets will inevitably contain some error, which has long been recognised as a problem in the accurate statistical analysis of genetic data.1,9 As genotyping errors are known specifically to affect certain genetic measurements such as LD, upon which association studies depend,20 and also to affect family-based studies of linkage and association,17 the identification of error is critical to accurate analysis and subsequent interpretation of the data. Current interest also surrounds the development of statistical methods that are able to take error into account in genetic data analysis.17,19,23,24,25,26 This large, empirical study reports the measurement of genotype distribution deviation from HWE as a method to identify and reduce genotyping errors generated as a result of the genotyping process itself in population-based studies.

The study measured deviation from HWE in 313 SNPs and revealed 36 HWD (P<0.05) assays, which is 2.3 × more than expected by chance. When considering these data, it is important to remember that the sensitivity of measurement of deviation from HWE will also depend on the minor allele frequencies of the SNPs typed (0.06–0.49), and the number of samples analysed (62–1018). Further investigation of the 36 HWD (P<0.05) assays revealed that 58% of them harboured genotyping error.

In this study, when the primers defining assay specificity were designed, high-level repeats in the genome sequence, including Alus and LINE, were masked. However, low-level repeat sequences are more difficult to monitor. In order to address this, primers and/or probes defining the reaction specificity for each of the 36 assays in HWD (P<0.05) were retrospectively analysed, by searching against NRNUC and NRHTG databases. In order to identify possible sequence homologies, no sequence filters were used at this stage. Assays developed for five SNPs appeared to be nonspecific. Two of these were SNPs that mapped to a 390 kb region on chromosome 22 flanking CYP2D6.32 Two pseudogenes, CYP2D7 and CYP2D8, lie adjacent to CYP2D6 and the primers defining assay specificity for the two nonspecific SNPs were found to map to either or both of the pseudogenes. Pseudogenes are clearly abundant,35 and therefore experimental design must take these sequences into consideration. The other three nonspecific SNP assays demonstrated 100% homology to more than one chromosomal region.

After analysis of the 36 HWD (P<0.05) SNPs, and the identification of those assays associated with genotyping error or nonspecificity, 10 SNPs remained. It is possible that the deviation from HWE observed in these SNPs is occurring by chance. However, as a large proportion of low-level duplications have not been represented following the assembly of the draft human genome sequence,36 it is conceivable that within this group of HWD (P<0.01) SNPs there are some assays that may be nonspecific, but all the sequences that their probes and primers are homologous to are not captured in the human genome sequence assembly.

The data reported here were generated during 1998–2000 using various different genotyping technologies. As a result of this retrospective analysis, a high-throughput standardised semi-automated genotyping process has been developed. This incorporates automatic ‘electronic PCR’ of primers before running the assay. All SNP assays are tested for deviation from HWE in 94 Caucasian DNA samples prior to running the SNP assay across the DNA sample set of interest. Genotypes are scored by highly trained scientists, and data accuracy is not compromised for individual assay genotype success rates. Following this process, analysis of genotypes generated from 94 unrelated Caucasians for 1434 SNPs revealed only 10 HWD (P<0.01) SNPs (unpublished data). This is slightly less than the number expected to occur by chance (14), suggesting improved data quality.

In conclusion, this study demonstrates the successful identification of a proportion of nonspecific assays and assays harbouring genotyping errors, by using a simple test for HWE. The genotyping process was subsequently modified in order to generate data of improved quality.