Almost certainly, one day, you’ll have between your hands a list of outdated gene symbols. And you’ll probably think that updating them is a straightforward task, but it’s not that simple! Because there’s the word ‘bio’ in bioinformatician, updating the gene symbols reminds me of the futile cycle. According to Wikipedia‘s definition, a futile cycle occurs when two metabolic pathways run simultaneously in opposite directions and have no overall effect other than to dissipate energy in the form of heat**. Updating the gene symbols sometimes make you feel like you’re dissipating a lot of energy for not a big overall effect. But it’s useful and necessary.
Updating the gene names themselves is not difficult. I can think of several ways nowadays to do that. Several online tools will help you.
In a perfect world, all symbols would be unique and they would be updated to a symbol never used before. In practice, this is not the case. And people usually work with the symbols and with the symbols only.
Suppose you have to update the gene symbols of a dataset where you have the PKD2 gene. Because it’s a human gene, you can use HUGO online tool which will give you this:
Input | Match type | Approved symbol | Approved name | HGNC ID | Location |
---|---|---|---|---|---|
PKD2 | Approved symbol | PKD2 | polycystic kidney disease 2 (autosomal dominant) | HGNC:9009 | 4q22.1 |
PKD2 | Synonyms | PRKD2 | protein kinase D2 | HGNC:17293 | 19q13.2 |
The same results can be achieved using the Entrez Gene file gene_info.gz :
PKD2 is an approved symbol AND is found in the synonyms (or aliases) of another gene. If you don’t have any other information, how do you know if your gene is the polycystic kidney disease 2 (autosomal dominant) or the protein kinase D2? You need to know! In fact, since there is no naming convention for genes, gene symbol disambiguation has become an area of research. Supervised learning, thesaurus-based methods and sets of rules are some examples of the different approaches considered to deal with this challenge. Nonetheless, it’s always better to work with unique identifiers in parallel such as Entrez Gene ID or Ensembl ID. These can be used as a key to retrieve the symbols. You could also work the chromosomal positions. Sooner or later, although working with the identifiers or the chromosomal locations, you’ll need a mapping tool (between ids or between genome revision). But even if identifiers may change or be withdrawn and chromosomal locations may be revised, those features will always be less ambiguous than the gene symbols!> zcat gene_info.gz | grep 9606 | cut -f1,2,3,5,8,9 |grep -e '
tax_id
GeneID
Symbol
Synonyms
location
description
9606
5311
PKD2
APKD2|PC2|PKD4|Pc-2|TRPP2
4q22.1
polycystic kidney disease 2 (autosomal dominant)
9606
25865
PRKD2
PKD2|nPKC-D2
19q13.3
protein kinase D2
**ReferenceMD has a more technical definition. Here’s an excerpt : “A set of opposing, nonequilibrium reactions catalyzed by different enzymes which act simultaneously, with at least one of the reactions driven by ATP hydrolysis. The results of the cycle are that ATP energy is depleted, heat is produced and no net substrate-to-product conversion is achieved.”
Leave A Comment