Article Text

## Abstract

**Introduction** Genome-wide homozygosity mapping is a powerful method for locating rare recessive Mendelian mutations. However, statistical power decreases dramatically in the presence of genetic heterogeneity.

**Methods** The authors applied an empirical approach to test for linkage accounting for genetic heterogeneity by calculating the sum of positive per-family multipoint LOD scores (S) across all positions, and obtaining corresponding empirical p values (EmpP) through permutations.

**Results** The statistical power of the approach was found to be consistently higher than the classical heterogeneity LOD by simulations. Among 21 first-cousin matings with a single affected child, for five families linked to a locus of interest and 16 families to other loci, S/EmpP achieved a power of 40% versus 28% for heterogeneity LOD at an α level of 0.001. The mean size of peak linkage regions was markedly higher for true loci than false positive regions. The S/EmpP approach was applied to a sample of 17 consanguineous families with Mendelian susceptibility to mycobacterial disease, leading to the identification of two mutations in *IL12RB1* and *TYK2* from the largest of six linkage regions at p<10^{−3}.

**Conclusions** The S/EmpP approach is a flexible and powerful approach that can be applied to linkage analysis of families with suspected Mendelian disorders.

- Genetic epidemiology
- infectious disease
- molecular genetics
- immunology

## Statistics from Altmetric.com

## Introduction

Genome-wide (GW) homozygosity mapping (HM) is a powerful tool for identifying susceptibility loci underlying rare recessive diseases in consanguineous families.1 HM can be viewed as a special case of model-based linkage analysis,2 which generally relies on the assumption that a single locus underlies the trait without consideration for genetic heterogeneity. However, if only a small proportion of families are truly linked to a given locus, the associated positive LOD scores can be overwhelmed by negative scores from unlinked families, masking the linkage evidence. The heterogeneity LOD (HLOD) statistic has classically been used to test for evidence of linkage under the hypothesis of heterogeneity.2 3 All families, linked and unlinked, contribute to the HLOD statistic, and the smaller the proportion of linked families, the lower the power to detect linkage.2

To address the problem of heterogeneity, we propose using the statistic S, which sums positive per-family LOD scores at a given position. For each S, a GW empirical level of significance, EmpP, is subsequently obtained through permutations. The idea of focusing only on LOD scores greater than zero and evaluating significance levels by Monte Carlo methods has already been proposed for single-point linkage4 and multipoint linkage in the context of complex traits.5 Here, we demonstrate the usefulness of the approach to GWHM. In GWHM studies, nuclear families with only one affected child are common and the recessive mode of inheritance is assumed, while in linkage studies of complex disease, multiplex families are typical and the mode of inheritance is often unknown. Through simulations, we explored the power to detect linkage using S/EmpP under different family structure scenarios. We also present an application of the proposed method to a cohort of consanguineous families with Mendelian susceptibility to mycobacterial disease (MSMD), which led to the identification of two mutations through the sequencing of candidate genes found within linkage regions.

## Methods

### The S statistic and EmpP

The statistic, S, relies on the results of a GW linkage scan. Multipoint LOD scores for each family and each position are stored in a P (corresponding to the total number of positions, indexed by i) by F (corresponding to the total number of families, indexed by j) matrix. At each position i, the statistic S_{i} is defined as:_{i} is evaluated empirically. Permutations are used to construct the distribution of S_{i} under the null hypothesis of no linkage (H_{0}) in the following manner: a total of Q (indexed by k) replicated ‘positions’ are generated from the original P×F LOD score matrix. For each k and each family j, a LOD score, *LOD _{jk}*, is drawn at random, with replacement from among the P-observed LOD scores for family j. The statistic S

_{k}is then computed at each k similarly to S

_{i}:

_{i}by computing the proportion of S

_{k}values exceeding S

_{i}.

### Simulations

For a given family, genotypes at biallelic markers were generated at random among founders and marry-ins from a minor allele frequency distribution of 0.4. A total of 1000 equidistant markers were attributed to each of 22 autosomes of size 140 cM (∼140 Mb) leading to a total of 22 000 single-nucleotide polymorphisms (SNPs) over the whole genome. An average of 1.5 crossovers per chromosomal transmission were defined by a Poisson process to match the mean number of crossovers from empirical data.6 For each replicate consisting of 21 families, we considered an increasing number of families linked to a disease locus, D, from 1 to 10. The families not linked to D were either (1.) not linked to any locus or (2.) linked to another locus from D on another chromosome. Three family structures were considered: i. first-cousin mating with a single affected child; ii. first-cousin mating with an affected child and two unaffected siblings; iii. second-cousin mating with a single affected child. D was placed at random and necessarily transmitted to the affected child. Genotypes for the parents and the affected child only were retained for linkage runs.

Power at D was calculated using the S/EmpP approach, setting Q at 100 000 and considering a type I error (α level) of 0.001, across the six combinations of (1.) and (2.) and i., ii. and iii. For each combination, 1000 simulated datasets were generated. We simultaneously calculated the HLOD. The distribution of HLOD is unstable and bounded by the one-sided χ^{2}_{1df} and χ^{2}_{2df} distributions,7 and significance levels for the HLOD are very close to χ^{2}_{1df} when the number of affected offspring per family is small.7 We therefore used this approximation to give an upper estimate of power for HLOD.

Finally, we also examined the size of linkage regions for the intermediate scenario (1.)i. with five linked families. The EmpP at D was considered for true positives (TPs), and the minimum EmpP from among all other 21 autosomes not containing the TP was considered for false positives (FPs). Size was defined by setting borders at the positions corresponding to a one-log increase in EmpP value down- and up-stream from the TP or FP EmpP position.

### Application to MSMD

MSMD (MIM 209950) is a rare syndrome characterised by severe infections by poorly virulent mycobacteria, such as BCG vaccines and environmental mycobacteria, in otherwise healthy subjects.8 As a primary immunodeficiency, MSMD has already been shown to display genetic heterogeneity: a total of six genes have been implicated in the disease, but about half of the patients do not have an identifiable defect.9–12 We conducted GWHM in a cohort of 17 consanguineous families with patients with MSMD (inbreeding loops were based on self-report from the families). Affected children were diagnosed between the ages of 6 months and 8 years. Familial configurations included 14 families where the parents are first cousins, two where the parents are second cousins, and one with two consanguinity loops. Four families had an additional unaffected sibling, three had an additional two unaffected siblings, and two had an additional three unaffected siblings.

Genotyping was conducted on a total of 57 individuals using the Affymetrix 250K SNP array at the Centre National de Génotypage in Evry, France. Autosomal SNPs were filtered on the basis of a minor allele frequency <0.1, genotyping call rate <95% and Hardy–Weinberg equilibrium p value <0.001, and no Mendelian errors were allowed. We applied the Tagger algorithm13 available in the Haploview software (http://www.broad.mit.edu/mpg/haploview/download.php) to select 33 923 autosomal tag SNPs with an R^{2} cut-off of 0.1. GWHM was conducted on the tagSNPs using MERLIN14 under a recessive model, and per-family multipoint LOD scores were obtained at each position. The S statistic and EmpP were calculated at each position using an in-house C script freely available for download at www.hgid.net.

## Results

Power to detect linkage at D according to number of linked families is displayed in figure 1A,B for family structure i. (first-cousin mating with a single affected child only). The S/EmpP approach is consistently more powerful than HLOD for a range of two to seven linked families when 1. no other families are linked to another locus (A) (eg, for six linked families, EmpP power=0.863 and HLOD power=0.643) or 2. All other families are linked to another locus (B) (eg, for six linked families, EmpP power=0.752 and HLOD power=0.626). Considering an intermediate scenario where five families are linked to D (figure 1C), EmpP power decreased slightly from 0.50 to 0.40 compared with HLOD, which was stable (∼0.27), as the number of other linked families increased from 0 to 16. For the two other family structures, ii. and iii., similar patterns were observed, and power for S/EmpP was higher overall, while the difference in power between S/EmpP and HLOD was smaller (data not shown).

In addition to power considerations for detecting linkage, it is also of interest whether the strongest signals correspond to the TP. For five linked families and no other loci, the linked chromosome out of all autosomes displayed the most significant EmpP for 65% of 1000 simulations, and the second most significant for 19.6% of simulations. Of all 1000 runs, 481 achieved an EmpP <0.001, and among the 21 other chromosomes per simulation, 545 runs displayed at least one FP signal with an EmpP <0.001. The mean and median sizes of linkage regions corresponding to these signals were 10.3 and 8.5 Mb for TPs and 5.9 and 4.9 Mb for FPs (figure 1D). The difference in mean sizes between TPs and FPs was found to be highly significant (p=10^{−42}). The proportion of linkage regions >10 Mb was found to be 40.1% for TPs and 10.6% for FPs.

The results of the GW linkage scan on 17 MSMD families using S/EmpP are displayed in figure 2A. Six peaks on chromosomes 3, 16, 6 (two peaks) and 19 (two peaks) were identified with minimum EmpP (mean size in Mb) of 0.00004 (0.338), 0.00043 (0.024), 0.00051 (1.50), 0.00008 (1.85), 0.00032 (3.38), and 0.00042 (21.1), respectively. The two largest regions, both on chromosome 19, were almost contiguous, separated by a small region of 3 Mb, with an EmpP for linkage <0.05 for the entire segment. Four families contributed to S at the first peak, and five families contributed to S at the second peak, including three families in common, which explained the pattern of linkage observed over the chromosome 19 region. The two genes in the region *IL12RB1* (a known MSMD gene9) and *TYK2* (involved in the signalling of several cytokines including interleukin (IL)-1215) were prioritised for sequencing. A homozygous missense mutation in exon 4 of *IL12RB1* (D101Y) was identified in a Turkish patient with disseminated mycobacterial disease, for whom no cells were available for prior functional characterisation, corresponding to the second chromosome 19 peak (figure 2B). The mutation D101Y was not found after sequencing of exon 4 among 1040 Centre d'Etude du Polymorphisme Humain (CEPH)16 controls, and was predicted with a probability of 99.2% to be probably damaging using Polyphen software.17 A 9 bp homozygous deletion in exon 16 of *TYK2* was identified in another Turkish patient with BCGosis, brucella meningitis and cutaneous herpes infections (S S Kilic *et al*, unpublished data) corresponding to the first chromosome 19 peak (figure 2B). This deletion, which is not in-frame, led to the creation of a premature stop codon at amino acid 767 and complete absence of production of the full protein (A Kreins *et al*, unpublished data). Sequencing of all linked patients did not uncover any other mutations in the coding sequence of these two genes. We did not sequence other genes in the interval in any of the patients. Altogether, this approach allowed us to diagnose IL-12Rβ1 deficiency, a known MSMD genetic aetiology, in one kindred, and to discover that Tyk2 deficiency, previously identified as a genetic aetiology of the hyper IgE syndrome,15 may also underlie MSMD.

## Discussion

We used simulations to present the utility of a method for handling genetic heterogeneity in the context of GWHM based on the sum of positive per-family multipoint LOD scores at each position (S), and evaluation of statistical significance (EmpP) through permutations. The S/EmpP approach is consistently more powerful than the HLOD for small numbers of families linked to the locus of interest up to saturation, regardless of the number of other linked families, which represented the heterogeneity context. There are several advantages to using S/EmpP for cohorts of consanguineous families. Family structures are typically heterogeneous, providing varying degrees of informativity as seen in our MSMD cohort, and the distribution of S under H_{0} maintains the same mix of family structures in a given study. The method considers data across the full genome in the assessment of significance, which is appealing for the interpretation of EmpP. These considerations and the ease of application make S/EmpP an effective approach for directing sequencing efforts, including newly available exome sequencing after linkage studies. Recent examples include the identification of a mutation underlying non-syndromic hearing loss18 and classic Kaposi sarcoma19 and the genetic basis of Fas-associated death domain protein (FADD) deficiency.20

In all GW linkage studies including GWHM, the investigator faces the vexing possibility of FP peaks. In GWHM specifically, the presence of a common ancestor within the family structure, and the small number of meioses between the common ancestor and the affected child, ensures that, on average, the true locus will be transmitted to the affected child with a relatively large common chromosomal background. For the first-cousin mating with a single affected child family structure and five out of 21 linked families (scenario 1.i.), the mean size of TPs was nearly twice the mean size of FPs (10.3 vs 5.9 Mb). While the distribution of linkage region sizes observed depends on the family structures in the sample and the number of truly linked families, we recommend that the investigator prioritise linkage regions that achieve a predefined EmpP value by size for further investigations.

This conclusion was corroborated by the MSMD application results: the two identified mutations come from the two largest linkage regions, themselves in close proximity. Interestingly, in these regions, the S/EmpP approach was more powerful than the classical HLOD method by about one order of magnitude. *IL12RB1* is a known MSMD gene,9 21 22 and *TYK2* has previously been identified as underlying the hyper IgE syndrome.15 Characterisation of the functional MSMD disease-causing role of the *TYK2* mutation is ongoing. Functional MSMD disease-causing role of the TYK2 mutation is on-going (A Kreins *et al*, unpublished data). It is worthy of note that our patient was subsequently found to display cutaneous viral disease, like the previously reported Tyk2-deficient patient and consistent with the role of Tyk2 in interferon α/β signalling (A Kreins *et al*, unpublished data).15 The finding of mutations in two distinct genes for two closely related linkage regions highlighted the presence of genetic heterogeneity within the linkage region. Coincidentally, the two genes control the same IL-12 responsive pathway. We have continued to apply the S/EmpP approach combined with exome sequencing to other samples of consanguineous families with Mendelian conditions, including MSMD, with promising results. Finally, the S/EmpP approach can be extended to any sample of families with a putative Mendelian aetiology.

## Acknowledgments

Thanks to the patients and their parents. Thanks also to members of the HGID laboratory, to Joan Bailey-Wilson, Robert Wojciechowski and Anne-Louise Leutenegger for helpful discussions.

## References

## Footnotes

Funding The Laboratory of Human Genetics of Infectious Diseases is supported by grants from the Schlumberger Foundation, the BNP Paribas Foundation, the Institut Universitaire de France, and the European Union (grant QLK2-CT-2002-0046), the Rockefeller University Center for Clinical and Translational Science (grant 5UL1RR024143), the Rockefeller University, the Bill and Melinda Gates Foundation, the St Giles Foundation, the Jeffrey Modell Foundation and Talecris Biotherapeutics, and National Institute of Allergy and Immunology (grant 1R01AI089970). AVG was supported in part by the Fondation pour la Recherche Médicale.

Competing interests None.

Ethics approval This study was conducted with the approval of the Rockefeller University.

Provenance and peer review Not commissioned; externally peer reviewed.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.