More information about text formats
We thank the Human Gene Mutation Database (HGMD) team for critically analysing our results, and highlighting some potential problems with our analysis.
Many of the criticisms raised by Stenson et al. relate to mutations that were outside the terms of reference of the study. For instance, a major criticism raised by Stenson et al. was that the review was not comprehensive viz we negle...
Many of the criticisms raised by Stenson et al. relate to mutations that were outside the terms of reference of the study. For instance, a major criticism raised by Stenson et al. was that the review was not comprehensive viz we neglected to compare all categories of mutations. We limited our comparison to single-base-mutations, as it was found too difficult to reliably text-mine other mutation types from OMIM. This could be viewed as indirect discrimination against HGMD but these mutations do represent the majority of characterized mutations in the database and thus represent a reasonable variable to quantify.
We concluded that both OMIM and HGMD were the most comprehensive, but differences between them, including missing genes, highlighted the importance of using both resources when searching for information on a gene.
Another criticism was our failure to distinguish between somatic and heritable mutations but nor did we claim to do so. Both are associated with disease. Also, it is not clear why HGMD does not include mitochondrial mutations, as these are inherited.
Claim 1 - 143 genes are present in OMIM but have no corresponding HGMD entry.
Of the 143 genes identified as not having single-base mutations in HGMD although they are present in OMIM:
The remaining 14% deserves further scrutiny. A genuine error did arise with the 12 genes mentioned that included FUZ. This error arose because many genes in OMIM and HGMD do not follow the HGNC official gene names. In our attempt to ensure that we compared the same gene from OMIM and HGMD, we converted all genes to their HGNC gene name, where available. Unfortunately, for a few cases, genes were assigned the wrong HGNC name. An example is the FUZ gene highlighted by Stenson et al. as having no mutations. This was a problem originating from OMIM where the gene described was not found to be a standard name, but a non-unique alias - FY, which is used for two different genes. FY was converted to the wrong HGNC name - FUZ, rather than the correct gene - DARC. This highlights a further problem in the general mutation databases and the importance that datasets adhere to the HGNC to prevent confusion.
The remaining eight genes, which included GPT, are listed as normal polymorphic variants. These are legitimately excluded from HGMD which specifically addresses disease but included in OMIM which has wider terms of reference "a catalog of human genes and genetic disorders". In the context of the purpose of our study which focused on disease, it is fair to call their inclusion into question.
Claim 2 - 226 genes in OMIM contain more mutation entries than HGMD.
Of the 83 genes listed by HGMD:
Claim 3 - Many mutations are missing from HGMD that were published in the journal Human Mutation.
Stenson et al. claim that the data presented in Supplementary Table 2 is "highly misleading" and that "HGMD is not missing any of the mutations that the authors claim." We have (yet again) performed searches in the public version of HGMD for the variations listed in Supplementary Table 2 from issues 25(5) - 27(12) of Human Mutation and stand by our original assertion that many mutations are missing from the public version of HGMD.
While the mutations under examination may very well have been entered into HGMD Professional within "1-2 of their publication in the Human Mutation issue cited in the table" as Stenson et al. claim, due to the delayed release of HGMD to non-subscribers, these mutations are not available in the public version, the specific database release our study sought to examine.
Stenson et al. go on to claim that "Several mutations had already been described in the literature prior to their publication as 'novel' in Human Mutation (e.g. PAX6 1410delC)." We note that the source of the PAX6 1410delC mutation (Sale et al. 2004) claims it as a novel mutation, and our inclusion of it, rather than counting against our study, could be seen as an indicator of the inaccuracies of variation databases, as non-novel mutations are being, not only mislabelled as novel, but published as such.
Finally, Stenson et al. note that we included "many neutral and somatic variants." As we outlined above, we never claimed to distinguish between somatic and heritable mutations.
As we noted in our paper, it is not always clear if a particular mutation is included in HGMD due to gene name changes between publications and HGMD, and HGMD’s use of non-standard nomenclature. In these cases, we have endeavoured to give HGMD the benefit of the doubt.
Claim 4 - R158Q in PAH is in error.
We thank Stenson et al. for pointing out the error in the electronic version of our manuscript regarding R158Q in PAH, we have now updated the main manuscript. The Dworniczak et al. (1989) article reports a G>A base change gives rise to R158E, but it gives rise to R158Q. Both OMIM and HGMD had rectified this error in their databases, but there is no annotation to indicate the correction. The original source, which lacks a published correction is cited as the definitive reference. This error has been corrected in our manuscript but it does highlight the need for detailed annotation of mutations eg conflicting reports.
Claim 5 - HGMD is missing two specific genes (COL9A1 and PTCH2)
& Claim 6 - Patchy coverage of gene and mutation data in HGMD.
Again, the use of single-base mutations only is a fair assessment as they do make up the majority of HGMD. The purpose of our study was to gauge what data is currently available and HGMD actually did well in comparison to the other databases.
Claim 7 - The authors claim no competing interests.
Competing interests usually refers to commercial interests. The interest of "several of the authors" is not to set up "from scratch a new and all embracing human variation/mutation databases". The aim is a complete collection of variation, and their phenotypes so that they can be curated expertly in locus specific databases that are public and which can be harvested/sent to comprehensive databases such as HGMD and those at NCBI, EBI and UCSC. We are pleased to say that a consortium of members of HGVS (www.hgvs.org) has been responsible for hundreds of LSDBs being created, being publicly available and made use of by HGMD and this effort will continue also under the banner of the Human Variome Project (www.humanvariomeproject.org).
In conclusion, we congratulate the curators of HGMD and OMIM for providing two such crucial resources for inherited disease diagnostics and research. Our study raised a number of explicable problems outlined by Stenson et al., many of which highlight the problems of the mutation databasing field viz: Non use of standard nomenclature, non coverage of all types of variation, lack of annotation of corrections and the need for public and private versions of HGMD due to funding strictures.
1. Sale MM, Craig JE et al. Broad phenotypic variability in a single pedigree with a novel 1410delC mutation in the PST domain of the PAX6 gene, Human Mutation 2004;20:322
2. Dworniczak B, Aulehla-Scholz C and Horst J. Phenylketonuria: detection of a frequent haplotype 4 allele mutation. Hum Genet 1989;84:95-96.
We write in response to a number of very specific criticisms of the Human Gene Mutation Database (HGMD) made in the recently published article of George et al. [PMID: 17893115]. All seven claims made were amenable to empirical testing. Having tested these claims, we find all of them to be either false or highly misleading. In the text that follows, we refute or rebut each claim in turn.
HGMD represents an...
HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease. HGMD comprises various types of germ-line mutation within the coding, splicing and regulatory regions of human nuclear genes. HGMD currently (1st October 2007) contains 73,411 different mutations in 2,768 human genes.
Claim 1 – 143 genes are present in OMIM but have no corresponding HGMD entry.
Response – This claim is wholly misleading. OMIM records many types of gene mutation which HGMD does not, such as somatic lesions, neutral polymorphisms and mitochondrial mutations. It is therefore to be expected that there will be some entries in OMIM that do not have a corresponding HGMD entry. We received the list of 143 genes from George et al. and performed our own analysis. After careful comparison with OMIM, we found the following;
• 33 genes (23.1%) contain only somatic allelic variants (e.g. LEF1).
• 29 genes (20.3%) contain variants exclusively from the mitochondrial genome (e.g. MTCO2).
• 8 genes (5.6%) contain exclusively normal polymorphic protein variants with no known disease association (e.g. GPT).
• 13 genes (9.1%) were actually present in HGMD at the time of the study, but, for whatever reason, had not been found by George et al. (e.g. CD2AP).
• 12 genes (8.4%) were misidentified by George et al. as having allelic variants in OMIM when they did not (or no longer have) (e.g. FUZ).
• 5 genes (3.5%) contained ‘disease-associated variants’ whose accompanying information was deemed to be in some way of insufficient quality to allow these to be entered into HGMD (e.g. GABRA2).
• 43 genes (30%) were entered into HGMD after the George et al. study was performed (e.g. SAMD9).
George et al. could legitimately have claimed that HGMD was missing 43 (not 143) genes on the basis of their study data. However, it should be noted that all but 3 of these 43 genes were entered into HGMD at the latest by the end of February 2007 (within 4 months of the stated date of the George et al. study). The remaining 3 genes were entered more recently. Thus, far from being an indictment of HGMD content, the analysis of George et al. would appear to confirm what we believe to be HGMD’s very high degree of efficiency in incorporating pathological mutations responsible for causing human genetic disease.
Claim 2 – 226 genes in OMIM contain more mutation entries than HGMD.
Response – This claim is false. George et al. provided HGMD with a list of the 226 genes which we have carefully reviewed. As 143 of these genes were present in this list simply due to their initial inclusion in the previous list (reviewed in Claim 1), we discounted these for the purposes of this analysis. We were therefore left with 83 genes to check. With respect to these 83 genes, we found the following;
• 23 (27.7%) entries contained additional somatic allelic variants (e.g. AXIN2).
• 10 (12%) entries contained additional polymorphic variants or haplotypes (e.g. CCR5).
• 5 (6%) entries (all globin genes) contained additional non-disease associated variants (e.g. HBA1).
• 17 (20.5%) entries actually had an equal number or fewer mutations than HGMD (e.g. HAL).
• 18 (21.7%) entries were added to HGMD after the George et al. study was completed (e.g. OTOF).
• 10 (12.1%) entries contained data which were indeed missing from HGMD (e.g. CFD).
Consequently, George et al. could, on the basis of their study data, legitimately have claimed that, at the time of writing, there were 28 (not 226) genes present in OMIM with more allelic variants than HGMD. The mutations from 18 of these genes were however added very shortly after the study of George et al. was concluded. The remaining 10 genes contained around 18 mutations which were inadvertently omitted from HGMD. Thanks, however, to George et al. and the prior efforts of the OMIM curators, the missing mutation data for these 10 genes have now been included in HGMD.
In their analysis, George et al. ignored several categories of mutation present in HGMD (small and gross deletions, insertions and indels, complex rearrangements and repeat variations). These categories contain significant (23,570 in HGMD Professional release 7.3) numbers of mutations. To ignore them in the published analysis was highly misleading and would have inevitably led to erroneous conclusions being drawn (due to an apparent failure to compare like with like). A good example of the type of error made is provided by the C6 gene. This gene currently has 4 allelic variants listed in OMIM (plus one neutral polymorphism). The HGMD entry for C6 has 9 mutations listed (6 at the time of the published study), yet according to George et al., HGMD had fewer mutation entries than OMIM for this gene.
Claim 3 – Many mutations are missing from HGMD that were published in the journal Human Mutation.
Response – This claim is highly misleading. These ‘missing’ mutations are listed in Supplementary Table 2 of the George et al. paper. We have carefully reviewed the data in this table and have concluded that HGMD is not missing any of the mutations that the authors claim. All but 4 of the disease-causing inherited lesions listed with either “no”, “not in website” or “unable to determine” had been entered into HGMD within 1-2 months of their publication in the Human Mutation issue cited in the table. Four others were entered later, but would have certainly been present at the time of the study. George et al. had access to HGMD Professional and could easily have obtained data entry dates from this version of HGMD had they been able to locate the listed mutations. Several mutations had already been described in the literature prior to their publication as ‘novel’ in Human Mutation (e.g. PAX6 1410delC). In accordance with HGMD policy, the earlier paper was given priority as the reference to be cited rather than the subsequent report in Human Mutation. George et al. also listed many neutral and somatic variants in their table as if they had expected to find these data in HGMD (e.g. ATM c.185+78A>G and ANP32C g.4870T>C). It is quite apparent to us that George et al. have displayed a complete lack of understanding of the nature of the data that HGMD seeks to collate, a serious inability to interpret published mutation data and/or an inability to undertake basic data searching and retrieval from HGMD.
Claim 4 – R158Q in PAH is in error.
This claim is incorrect. The reference cited by both HGMD and OMIM [Dworniczak et al., Hum. Genet. (1989) 84: 95-6] contained an error, in that the G>A base change reported would have given rise to R158Q and not R158E as described. As part of our curation process, we corrected this error (and the curators of OMIM have done likewise). It is noteworthy that the reference given by George et al. for this mutation [Hennermann et al., Hum. Mutat. (2000) 15: 254-260] does not actually claim that this lesion was novel to their study.
Claim 5 - HGMD is missing two specific genes (COL9A1 and PTCH2).
Response – This claim is incorrect. COL9A1 has been present in HGMD since 2001 with one small insertion mutation logged at that time. Since George et al. elected to utilise only single base-pair substitutions in their analyses (thereby ignoring approximately one third of the HGMD dataset), this entry was missed [a nonsense mutation (R272X) was added to this entry shortly after the George et al. study was concluded]. The question should be again raised as to why the authors chose to ignore HGMD micro-insertions, micro-deletions, indels, gross lesions and repeat variations in their analyses, thereby excluding some 23,570 different human gene lesions and 254 genes logged in HGMD with only these categories of mutation. The second gene that was claimed to be “missing” from HGMD was PTCH2. However, the two allelic variants listed in OMIM for this gene are both somatic and so HGMD did not “miss” this gene at all, since HGMD only includes heritable lesions.
Claim 6 – Patchy coverage of gene and mutation data in HGMD.
Response – This assertion was made on the basis of a study that appears to be deeply flawed, methodologically and statistically. The authors seem to have little appreciation or understanding of the types of mutation data recorded by either OMIM or HGMD. Once again, and for whatever reason, the authors excluded 23,570 mutations (almost one third of HGMD data) from their analysis. They then have the temerity to criticise HGMD for patchy coverage!
Claim 7 – The authors claim no competing interests.
Response – In our opinion, this claim is hard to justify. Several of the authors of the George et al. paper are currently seeking substantial funding to set up from scratch a new and all embracing human variation/mutation database. Since HGMD is in practice the only comprehensive central repository for human gene mutations in existence, their comparative ‘analysis’ of HGMD data should at the very least, in our view, have been accompanied by a clear statement of the potential conflict of interest inherent in their critical conclusions. It is quite disingenuous for the authors to claim otherwise.
In summary, in a deeply flawed study, George et al. have drawn numerous incorrect or misleading conclusions with respect to HGMD, its remit, content and coverage. Their study represents a graphic example of how over-reliance on automated text-mining, a reluctance to attempt any independent verification of their initial findings and an apparent lack of knowledge of the mutation databases they were analysing, can combine together to yield wholly erroneous conclusions. We are not in any way resistant to the idea of data quality assessment, but any such assessment should at the very least adhere to certain basic analytical standards and ought to be carried out in a proper scientific manner. We were not contacted prior to this article being accepted for publication. Had we been asked to comment, we could have easily cleared up the many inaccuracies and misinterpretations that litter the George et al. paper. Having said this, however, it is unclear whether the authors would then have been able to draw any meaningful conclusions other than that HGMD has succeeded in providing fairly comprehensive coverage of its target data viz. mutations in human nuclear genes causing inherited human disease. Thus, it would appear that far from providing evidence for the shortcomings of a central mutation database, George et al. have inadvertently succeeded in demonstrating that HGMD fulfils this role exceptionally well.