Almost certainly, one day, you’ll have between your hands a list of outdated gene symbols. And you’ll probably think that updating them is a straightforward task, but it’s not that simple! Because there’s the word ‘bio’ in bioinformatician, updating the gene symbols reminds me of the futile cycle. According to Wikipedia‘s definition, a futile cycle occurs when two metabolic pathways run simultaneously in opposite directions and have no overall effect other than to dissipate energy in the form of heat**.  Updating the gene symbols sometimes make you feel like you’re dissipating a lot of energy for not a big overall effect.  But it’s useful and necessary.

Updating the gene names themselves is not difficult.  I can think of several ways nowadays to do that.  Several online tools will help you.

In a perfect world, all symbols would be unique and they would be updated to a symbol never used before. In practice, this is not the case. And people usually work with the symbols and with the symbols only.

Suppose you have to update the gene symbols of a dataset where you have  the PKD2 gene.  Because it’s a human gene, you can use HUGO online tool which will give you this:

 

Input Match type Approved symbol Approved name HGNC ID Location
PKD2 Approved symbol PKD2 polycystic kidney disease 2 (autosomal dominant) HGNC:9009 4q22.1
PKD2 Synonyms PRKD2 protein kinase D2 HGNC:17293 19q13.2

 

The same results can be achieved using the Entrez Gene file gene_info.gz :

>  zcat gene_info.gz | grep 9606 | cut -f1,2,3,5,8,9 |grep -e '

[^a-Z0-9]PKD2[^a-Z0-9]'

tax_id GeneID Symbol Synonyms location description
9606 5311 PKD2 APKD2|PC2|PKD4|Pc-2|TRPP2 4q22.1 polycystic kidney disease 2 (autosomal dominant)
9606 25865 PRKD2 PKD2|nPKC-D2 19q13.3 protein kinase D2

PKD2 is an approved symbol AND is found in the synonyms (or aliases) of another gene.  If you don’t have any other information, how do you know if your gene is the polycystic kidney disease 2 (autosomal dominant) or the protein kinase D2?  You need to know!

In fact, since there is no naming convention for genes, gene symbol disambiguation has become an area of research. Supervised learning, thesaurus-based methods and sets of rules are some examples of the different approaches considered to deal with this challenge.

Nonetheless, it’s always better to work with unique  identifiers in parallel such as Entrez Gene ID or Ensembl ID.  These can be used as a key to retrieve the symbols.  You could also work the chromosomal positions. Sooner or later, although working with the identifiers or the chromosomal locations, you’ll need a mapping tool (between ids or between genome revision).  But even if identifiers may change or be withdrawn and chromosomal locations may be revised, those features will always be less ambiguous than the gene symbols!


**ReferenceMD has a more technical definition. Here’s an excerpt : “A set of opposing, nonequilibrium reactions catalyzed by different enzymes which act simultaneously, with at least one of the reactions driven by ATP hydrolysis. The results of the cycle are that ATP energy is depleted, heat is produced and no net substrate-to-product conversion is achieved.”