Pseudogenes were initially regarded as non-functional genomic fossils resulted from inactivating gene mutations during evolution. However, later studies revealed that they play a plethora of roles at multiple levels (DNA, RNA and/or protein) in diverse physiological and pathological processes, especially in cancer, both parental-gene-dependently and parental-gene-independently. Pseudogenes can interact with parental genes or other gene loci, leading to alteration in their sequences and/or transcriptional activities. Pseudogene-derived RNAs play multifaceted roles in post-transcriptional regulation as antisense RNAs, endogenous small-interference RNAs, competing endogenous RNAs and so on. Pseudogenic proteins can mirror, mimic or interfere with the functions of their parental counterparts. Herein, we discuss the general aspects (origination, classification, identification) of pseudogenes, focus on their multiple functions in cancer pathogenesis and prospect the potentials they hold as molecular signatures assisting in cancer reclassification and tailored therapy.
- Cell biology
- Clinical genetics
Statistics from Altmetric.com
The word ‘pseudogene’ was first introduced by Jacq et al1 in 1977, when a copy of the 5S rRNA gene was discovered in Xenopuslaevis, with 5′-end truncation and 14-bp mismatches that render it non-functional. Since then, numerous pseudogenes have been discovered in organisms from prokaryotes to eukaryotes. In human genome, there are about 11 000 pseudogenes,2 exceeding half the number of protein-coding genes.
Traditionally, pseudogenes are considered as genomic loci that resemble real genes, yet are biologically inconsequential because they harbour mutations that abrogate their transcription or translation.3 Resultantly, they were once regarded as ‘junk genes’, ‘relics of evolution’ or ‘genomic fossil’.4 ,5 Recently, however, with the aid of next-generation sequencing and research advance in non-coding RNAs, multilayered functions of pseudogenic DNA, RNA or protein have been discovered in multiple cancers. Pseudogenes play important roles in transcriptional and post-transcriptional regulations and also have the potential to evolve into novel genes, thus serving as a reservoir for gene renewal. Moreover, a small handful of pseudogenes have been reported to retain or regain protein-coding properties, and the resultant pseudogenic proteins/polypeptides mirror or interfere with the functions of their parental counterparts in tumorigenesis.6–11 In this review, we discuss the identification, classification, functions and clinical relevance of pseudogenes in cancer, with recent advances and future perspectives.
Origination and classification of pseudogenes
The existence of more than one copy of a gene in human genome allows the production of gene variants which may generate novel genes in some contexts, whereas they give birth to pseudogenes in others. Pseudogenes can derive from gene mutations, or unfaithful gene duplications, or retrotransposition of processed mRNAs back into the genome. Accordingly, pseudogenes can be categorised into three types: (1) unitary pseudogenes (figure 1A), (2) duplicated or unprocessed pseudogenes (figure 1B) and (3) processed or retrotransposed pseudogenes (figure 1C).
Unitary pseudogenes are generated when spontaneous mutations in a coding gene abolish either transcription or translation of that gene. As a result, unitary pseudogenes lack the fully functional counterparts (termed ‘ancestral genes’, ‘cognate genes’ or ‘parental genes’) as the other two types of pseudogenes do. Duplicated pseudogenes derive from unfaithful gene duplication, resulting in the loss of promoters/enhancers or frameshift mutations or premature stop codons, thus rendering them non-functional, whereas their parental genes remain functional. Duplicated pseudogenes are often located within the vicinity of their parental genes. Both the unitary pseudogenes and the duplicated pseudogenes retain intron–exon structures. On the contrary, processed pseudogenes lack intons because they are originated from mRNAs that are reverse-transcribed into DNAs and then integrated back into the genome at a new location.
Identification and cancer-specific expressions of pseudogenes
Due to their high homology to parental genes, a major challenge faced by pseudogenes studies is how to distinguish them from their parental genes, with individual genome differences and sequencing errors further complicating the matter. In recent years, multiple approaches have been developed for this purpose at DNA level12–14 or, for expressed pseudogenes, at RNA level.14–16
Pipelines established to identify pseudogene DNA include PseudoPipe,17 the Human and Vertebrate Analysis and Annotation (HAVANA) method,18 PseudoFinder and RetroFinder.19 These pipelines have now been integrated into a consensus platform called ENCyclopedia Of DNA Elements (ENCODE), the most comprehensive database for pseudogenes at present.20 ,21
Previously, approaches to identify pseudogene RNA were quite limited, mainly relying upon incongruent gene expression platforms, such as public mRNA and Expressed Sequence Tag databases, Cap Analysis Gene Expression libraries or gene identification signature-paired end tags.22 In 2012, Shanker Kalyana-Sundaram and his colleagues15 developed a bioinformatics pipeline to detect pseudogene transcriptions based on next-generation sequencing data of 293 samples (13 cancer and normal tissue types) and identified 2082 pseudogene transcripts, among which 154 are highly tissue-specific and 218 expressed only in cancer samples (178 expressed in multiple cancers, while 40 were single cancer-specific). Of them, a breast cancer-specific pseudogene, ATP8A2-ψ, was selected to be validated by Taqman assays, and the result shows strong concordance to bioinformatics analysis, with ATP8A2-ψexpression found to be restricted to breast tumours with luminal histology. Moreover, subsequent overexpression and knockdown experiments in vitro indicated an oncogenic role of ATP8A2-ψ.15 Similarly, two recent studies by Shen et al23 and Pan et al24 reported that the polymorphism of pseudogenes POU5F1P1 rs10505477 and E2F3P1rs9909601, are correlated with patients’ prognosis of gastric cancer and liver cancer, respectively. Another example of clinically significant pseudogene comes from POU5F1B, a processed pseudogene located adjacent to MYC on human chromosome 8q24, which is a reliable prognostic marker for patients with stage IV gastric cancer and shows oncogenic role both in intro and in vivo.25
Recently, Han et al16 developed a similar computational pipeline and detected 9925 pseudogene transcriptions in 2808 samples across seven cancer types from The Cancer Genome Atlas (TCGA) RNA-seq data. Of the detected pseudogene transcripts, many are tissue and/or cancer-specific. Moreover, this study for the first time systematically revealed the potential of pseudogenes as prognostic and subtype biomarkers in cancers. Tumour subtypes based on pseudogene expression profiles showed high concordance with molecular subtypes based on other omic data such as mRNA expression, miRNA expression, DNA methylation and somatic copy number variation (SNV). Moreover, in kidney cancer, subtypes based on pseudogenes showed stronger prognostic power than those based upon mRNA, miRNA or other molecular data.16
The detection efficacy of these RNA-based bioinformatics pipelines are determined by three factors: (1) pseudogene expression level (highly expressed pseudogenes are more likely to be detected), (2) coverage depth of RNA sequencing (the deeper the coverage, the more sensitive the detection) and (3) mismatch distribution patterns of between pseudogene and parental gene (eg, mismatches accumulated in a small stretch of sequence are more likely to be detected than mismatches scattered over long stretches). Future progression in sequencing accuracy and coverage depth will facilitate the researches in this regard.
Functions of pseudogenes in cancer
Pseudogenes were once regarded as functionally inert and subject to random genetic drift. However, studies in recent decades have piled up evidences for pseudogene evolutionary conservation across different mammalians.26–28 The non-synonymous to synonymous substitution rate (Ka/Ks) is usually applied to determine whether a sequence is under evolutionary constraint. Generally speaking, Ka/Ks is less than one if the sequence is under purifying selection, equal to one if it is evolving neutrally and greater than one if under positive selection. Theoretically, non-functional sequences should be under neutral selection, and their Ka/Ks ratios are expected to be equal to one. However, it has been reported that Ka/Ks values between genes and pseudogenes overlap greatly, suggesting that some pseudogenes are under evolutionary constraint rather than evolving neutrally, lending support to their roles as functional units.29–31
In recent years, multilayered functions of pseudogene DNAs, RNAs or proteins have been reported in diverse cancer types.
Functions of pseudogene DNAs
Pseudogene DNAs can function via gene conversion, homologous recombination, exonisation or insertional mutations, of which the former two events often occur between pseudogene and parental gene, whereas, the latter two are usually between pseudogene and host gene (figure 2A–E).
Gene conversion is a process in which one DNA sequence replaces a homologous sequence such that the sequences become identical after the conversion (figure 2A). Theoretically, gene conversion from pseudogene to parental gene provides an ideal chance for oncogene activation and/or tumour suppressor gene inactivation.
Conversions from pseudogene to parental gene have been reported in several human diseases.32 ,33 Cytochrome P450 2A6 (CYP2A6), for example, enzyme metabolising precarcinogens including nicotine, has a pseudogene called CYP2A7. Gene conversion from CYP2A7to CYP2A6 results in the generation of CYP2A6*1B, which has higher nicotine metabolism activity in vivo and thus influences cigarette consumption as well as smoking-induced lung cancer risk.34
Pseudogene and parental gene can exchange DNA sequence with each other, a process called homologous recombination (figure 2B). The breast and ovarian cancer-susceptibility gene BRCA1, for example, has a pseudogene called PsiBRCA1 lying upstream of BRCA1. In two families with breast and ovarian cancer, homologous recombination took place between BRCA1 intron 2 and PsiBRCA1intron 2, which results in a 37-kb deletion, deprives BRCA1 of its promoter and initiation codon, rendering it non-functional. The DNA recombination between BRCA1 and its pseudogene represents a new mechanism for oncosuppressor gene inactivation in cancer.35
Recently, a pioneering study conducted by Susanna L Cooke and her team14 identified 42 somatically acquired pseudogenes in 14 out of 629 primary cancer samples and 3 out of 31 cancer cell lines via bioinformatic analysis followed by PCR validation. Among these pseudogenes, 16 were subsequently analysed to investigate the effect of their insertion sites on their expressions. Of these, none of the 10 pseudogenes inserted into intergenic regions are expressed, one of the three pseudogenes in introns is expressed, while two pseudogenes in 3′ UTRs are both expressed. This result indicates that pseudogenes inserted in introns or 3′ UTRs are capable of harnessing the transcriptional mechanism of host gene to be expressed, while pseudogenes in intergenic regions are far less likely to be expressed because they lack host genes and, thus, usable transcriptional mechanisms.
KLK3 1P, an unprocessed pseudogene of KLK3, has five exons. Exons 3 and 4 are duplicate copies of KLK1 exon 2, while the other three are ‘exonised’ de novo from interspersed repeats. KLK3 and KLK3 1P are both regulated by androgen. Interestingly, unlike KLK3, whose protein level is increased in the serum of patients with prostate cancer, the expression level of KLK3 1P step down from normal prostate epithelium cells to localised primary cancer cells to metastatic cells,36 indicating that KLK3 1P may function independently of its parent gene in prostate cancer.
Apart from aforementioned functions, a newly discovered function of pseudogene DNA in cancer is that it inserts into the promoter/exons of the host gene and abolishes the latter from expression (figure 2E). Here, we call this process as ‘Insertional Mutation’. In lung adenocarcinoma cell line NCI-H2009, a pseudogene called PTPN12 was reported to insert into Exon 1 of MGA, a possible oncosuppressor gene encoding a MAX-interacting protein. This insertion deletes the promoter and exon1 regions of MGA and renders it unexpressed.15 Given that tumour evolves with accumulation of mutations, insertional inactivation of oncosuppressor gene by pseudogene may represent a new layer of genetic mutations during tumorigenesis.
Functions of pseudogene RNA
Though only a minor fraction of pseudogenes are transcribed, pseudogene transcripts play diverse roles in post-transcriptional regulation (figure 3A–F). Apart from the traditional roles as antisense RNA or endogenous small-interference RNA (endo-siRNA or esiRNA), they can also function as endogenous competitors for miRNA, for RNA-binding protein (RBP) or for translational machinery. And, in some cases, chimeric RNAs can form between pseudogenes and genes, with their functions still to be clarified.
As antisense RNA
Pseudogene RNA that is transcribed in antisense can combine directly with the parental sense mRNA to inhibit its translation (figure 3A). For example, neural nitric oxide synthase (nNOS) mRNA hybridises with an antisense nNOS pseudogene transcript, forming double-strand RNA–RNA duplex and resulting in nNOS translation suppression.37 However, to date, there is still lack of a validated example that pseudogene RNAs function in this manner in cancer.
Antisense pseudogene RNAs can also function as endo-siRNA, which is shown below.
Some pseudogene transcripts can be processed into endo-siRNAs (figure 3B–D). There are two major sources of pseudogene-derived endo-siRNAs. One is from hybrid double-stranded RNAs (dsRNAs) composed of sense and antisense RNAs involving pseudogene (figure 3B, C), and the other from the inverted repeat region of pseudogene that is transcribed into hairpin-shaped RNA (figure 3D). These hairpin-shaped RNAs or hybrid dsRNAs can be sliced by Dicer (a ribonuclease protein) into smaller fragments known as endo-siRNAs that are subsequently separated into single strands and incorporated into the RNA-induced silencing complex, degrading target mRNAs, a process called RNA interference.38
For example, in mice oocyte, a pseudogene called Au76 is transcribed into long-hairpin RNA and diced into siRNA, regulating expression of its parental gene (Rangap1).39 Again in mice, pseudogene of Hdac1 can be transcribed both in sense and in antisense, which then anneal to each other to form dsRNA that is sliced into siRNAs, regulating the expression of its parental gene.40
In hepatocellular carcinoma, pseudogene-derived endo-siRNAs have been reported.41 ψPPM1 K is a partial retrotransposed pseudogene with inverted repeats transcribed into long-hairpin RNA that is processed into two endo-siRNAs, which target and inhibit the expression of the parental gene (PPM1 K) and another gene (NEK8), leading to altered mitochondrial activation and decreased cancer cell proliferation, respectively, suggesting an oncosuppressive role both parental-gene-dependently and parental-gene-independently.
As competing endogenous RNA
In recent years, a newly discovered RNA regulatory mechanism called competing endogenous RNAs (ceRNA) has been the hotspot in cancer research.3 ,42–46 ceRNAs refer to RNAs that share miRNA response elements (MRE) and, therefore, regulate each other's expression by competing for the same pool of miRNAs (figure 3E). Theoretically, any RNA that contains MRE can serve as ceRNAs, or RNA sponges, including pseudogene RNAs. Due to the fact that pseudogene RNAs harbour many of the same MREs as their parental RNAs, they are perfect candidates to form ceRNA pairs with their parental RNAs. Additionally, pseudogene RNAs as ceRNAs regulating expressions of genes other than parental genes have also been reported.
For example, pseudogene OCT4-pg4 transcript was reported to function as a ceRNA to regulate the expression of its parental gene OCT4 by competing for miR-145 in liver cancer. Moreover, the expression level of OCT4-pg4 is significantly relevant with patients’ prognosis. Subsequent bimolecular experiments suggested oncogenic role of OCT4-pg4 in HepG2 cell line.46
PTENP1, a pseudogene of the famous tumour suppressor gene phosphatase and tensin homolog (PTEN), was found to act as ceRNA both parental-gene-dependently and parental-gene-independently. PTENP1 was found to increase cellular levels of PTEN mRNA in prostate cancer through competitively binding to miR-17, miR-19, miR-21, miR-26 and miR-214 families, freeing PTEN mRNA from miRNA-induced suppression.3 Intriguingly, however, in PTEN knockout cancer cells, PTENP1 showed oncosuppressive role as well, suggesting its oncosuppressive role is at least partially parental-gene-independent. Subsequent study revealed that PTENP1 knockdown leads to reduced levels of p21 in cancer cells.3 Considering that p21 is a target of the miR-17 family and that PTENP1 sequesters miR-17 families, it is reasonable to infer that PTENP1 sequesters miR-17 and reverses miR-17-mediated p21 suppression.
Given the ubiquitous ceRNA network in post-transcriptional regulation3 ,42–46 and the prevalent existence of pseudogene in human genome,2 it is sensible to expect an increasing number of pseudogene-involved ceRNA networks identified in cancers.
As competitors for RBP or translational machinery
Due to the high similarity in sequence, pseudogene RNAs can also compete with their parental counterparts for RBPs or translational machinery, and thus exert a regulatory role on the latter (figure 3F).
The effects of competition between pseudogenic and parental RNA for RBPs depend on functions of the RBP. For an RNA-stabilising RBP, it would lead to reduced parental RNAs. Conversely, for a RNA-degenerating RBP, it would lead to parental RNAs upregulation (figure 3F). For example, MYLKP1, a pseudogene of omyosin light chain kinase (MYLK) gene that encodes non-muscle and smooth muscle myosin light chain kinase (smMLCK) isoforms, inhibits parental RNA expression and thus promotes cancer cell proliferation. Subsequent mechanism research revealed that coexpression of MYLKP1 with smMLCK leads to decreased mRNA stability of smMLCK, suggesting competition may exist between this pair of pseudogenic and parental RNA for RNA-stabilising RBPs.47
Upon competition for translational machinery, it will result in decreased translation of parental RNAs (figure 3F). ψCx43, for example, is a pseudogene of connexin43 (Cx43), which encodes a protein involved in intercellular communication and tumour pathogenesis. In breast cancer, ψCx43 inhibits Cx43 translation since the former binds to the translation machinery more efficiently than the latter. Knockdown of ψCx43 leads to increased levels of Cx43 mRNA and protein and thus increased cellular sensitivity to chemotherapeutics.48
As chimeric RNAs
Various chimeric RNAs have been identified in multiple cancers recently, with some of them expressed cancer-specifically.49–51 Herein, chimeric RNA refers to an RNA sequence that is transcribed partially from pseudogenes and partially from other genes, but is fused together as a whole (figure 2C).
For example, a chimeric RNA transcript composed of the first two exons of KLK4 and the last two exons of pseudogene KLKP1 has been identified in prostate cancer.15 ,52 This chimeric RNA was highly expressed in 30%–50% of prostate cancer tissues, with barely any expression in benign prostate or other tissues, suggesting a cancer type-specific and tissue-specific expression pattern. However, whether this chimeric RNA can be translated into protein, or how it functions in prostate cancer, remain unclear.
Functions of pseudogenic protein
By definition, pseudogenes are gene loci harbouring premature stop codons, indels or frameshift mutations that abrogate their translation.3 In reality, however, though the majority of pseudogenes have lost protein-coding ability, a small handful of processed pseudogenes retain or regain this ability. The first pseudogenic protein was discovered in 2002, namely PGAM3, a protein coded by a processed pseudogene in primate white blood cells.53 Later on in 2004, pseudogenic protein in breast cancer cell lines was identified, and the protagonist is the aforementioned pseudogene ψCx43.48 ψCx43 is translated into protein that is highly homologous to Cx43 protein and exhibits growth-suppressive behaviour similar to Cx43 protein.
NANOGP8, one of the 11 pseudogenes of NANOG gene that plays key roles in embryonic stem cell self-renewal, encode a protein detected by anti-Nanog antibody in OS732 cell (human osteosarcoma cell line) and HepG2 (human liver cancer cell line).54 And later on in prostate cancer,55 the amino acid sequence of NANOGP8-encoded protein was identified and proven to be highly identical to that of NANOG protein. In this study, NANOGP8 was found to be the major source of NANOG RNAs, the abundance of which was correlated with the number of CD44-positive cancer stem cells. And accordingly, RNA interference-mediated NANOG knockdown inhibited tumour development, both in vitro and in vivo.55 Though mRNA and protein derived from NANOGP8 and NANOG are almost identical, their expression patterns are somewhat different. For example, both NANOGP8 and Nanog were transcribed in HepG2, whereas only NANOGP8 was transcribed in OS732.54 In another study,56 NANOG was found to be dominantly expressed in SW620 colon cancer cell line, while NANOGP8 was the major form in two other colon cancer cell lines: HT29 and HCT116. These studies suggest that certain pseudogenes can encode protein with almost identical functions to those of their counterpart proteins, but are expressed in different patterns (figure 4A).
However, not all pseudogenic proteins are fully functional. BRAF is a serine/threonine kinase that is involved in mitogen-activated protein kinase (MAPK)-signalling cascade and mutated in multiple human cancers. Its pseudogene, BRAFP1, located on chromosome Xq13, has many stop codons that abrogate it from translation into a fully functional protein. However, the longest open reading frame of BRAFP1 can Encode a 244 aminoacid polypeptide, which has high-sequence homology with the CR1 domain of wild-type BRAF protein, and interacts with the latter, thus activating MAPK pathway, exerting an oncogenic role in thyroid tumours. Intriguingly, BRAFP1 RNAs were more frequently detected in samples without BRAF mutation, indicating that either of these two events is sufficient to drive tumorigenesis.57 In this case, though pseudogenic protein is not fully functional, it can influence the activity of parental protein and thus play a role in tumorigenesis (figure 4B).
Certain pseudogenic proteins or short peptides derived from open-reading frames, on the other hand, are recognised by the human immune system as ‘antigens’ (figure 4C). Examples of this kind have been reported in melanoma58 and in sarcoma.59 Though theoretically ‘self’, cancer cells can produce proteins that are spatiotemporally inappropriate, thus being recognised by the immune system as ‘non-self’. Research on pseudogenic antigens is still in its infancy, but holds promise to give rise to new tumour markers or therapeutic targets.
Conclusion and perspective
It is reported that pseudogenes outnumber half the protein-coding genes in the human genome.2 The prevalent existence of pseudogenes indicates that they may play a vital role in basic physiology and disease progression. Traditionally, pseudogenes were viewed as ‘junk DNA’ or ‘genomic fossils’, due to the fact that they are either not transcribed or not translated into functional proteins. However, studies in recent decades indicate that they are far more than merely ‘junk’ or ‘non-functional’. In fact, they play a plethora of roles at multiple levels (DNA, RNA and protein) both in health and in disease, and especially, in cancer. As stated above, pseudogenes represent a reservoir for gene evolution and/or protein diversity. Pseudogenes can interact with parental genes or other gene loci, altering their sequences and/or transcriptional activities (figure 2A–E). Upon being transcribed into RNA, pseudogenes take on a diversity of post-transcriptional regulatory roles in cancer, such as antisense RNA, endo-siRNA, ceRNA, chimeric RNA and RBP and/or translational machinery competitors (figure 3A–F). Moreover, the discoveries of a small handful of pseudogenes capable of encoding proteins54–56 that mirror, mimic or interfere with the functions of their parental counterparts (figure 4A–C) have blurred the distinction between genes and pseudogenes. Therefore, in the term, pseudogene, ‘pseudo’ implies sequence variances compared with parental gene, not indicating pseudofunction. Though sequence mutations render them ‘pseudo-’ relative to parental genes, many of them perform real and indispensible functions in physical and pathological processes.
A main obstacle in pseudogene research comes from the close homology of DNA, RNA and/or protein sequences between pseudogene and parental gene that render it hard to distinguish the former from the latter. In the past, pseudogenes were considered as nasty ‘noises’ that would interfere with the detection of their parental genes. As a result, various endeavours were made to rule out rather than identify pseudogenes. With the advent of the next-generation sequencing era, massive multi-omic data are available at online databases, such as TCGA, Gene Expression Omnibus, International Cancer Genome Consortium, ENCODE and so on, which greatly facilitate the research on cancer genome, epigenome and proteome. In this background, many pseudogenes are newly identified with the aid of bioinformatics analysis, especially at RNA level.14–16
Reclassification of cancers within and across tissue-of-origin based on multi-omic data is a hotspot in cancer research nowadays. Pioneering studies have already reclassified several common cancers, providing independent prognostic power and potential therapy-guiding values.60 ,61 Recently, pseudogenes are also applied to act as molecular signatures for cancer subtyping. In a study16 analysing pseudogene expression patterns in six cancer types, many pseudogenes were found to be expressed differentially among tumour subtypes. Pseudogene-based tumour subtypes show strong concordance with subtypes based on other omic data, including mRNA, miRNA, DNA methylation and SNV. In kidney cancer, pseudogenes even show stronger prognostic power than other molecular signatures. Therefore, it is reasonable for us to predict that, in the future, pseudogene signature, along with other molecular signatures, will pave the way for cancer reclassification and tailored therapy.
LX and GAM contributed equally and are co-first authors.
Contributors JL-J and XJ conceived the idea and collected literature; LX-J and GA-M read through and analysed relevant literature; LX-J and GA-M wrote the manuscript while JL-J and XJ revised it.
Funding Shanghai Science and Technology Committee Yangfan Program (14YF1411900).
Competing interests None.
Provenance and peer review Commissioned; externally peer reviewed.