Gene symbols : the challenge

Almost certainly, one day, you’ll have between your hands a list of outdated gene symbols. And you’ll probably think that updating them is a straightforward task, but it’s not that simple! Because there’s the word ‘bio’ in bioinformatician, updating the gene symbols reminds me of the futile cycle. According to Wikipedia‘s definition, a futile cycle occurs when two metabolic pathways run simultaneously in opposite directions and have no overall effect other than to dissipate energy in the form of heat**. Updating the gene symbols sometimes make you feel like you’re dissipating a lot of energy for not a big overall effect. But it’s useful and necessary.

Updating the gene names themselves is not difficult. I can think of several ways nowadays to do that. Several online tools will help you.

In a perfect world, all symbols would be unique and they would be updated to a symbol never used before. In practice, this is not the case. And people usually work with the symbols and with the symbols only.

Suppose you have to update the gene symbols of a dataset where you have the PKD2 gene. Because it’s a human gene, you can use HUGO online tool which will give you this:

Input	Match type	Approved symbol	Approved name	HGNC ID	Location
PKD2	Approved symbol	PKD2	polycystic kidney disease 2 (autosomal dominant)	HGNC:9009	4q22.1
PKD2	Synonyms	PRKD2	protein kinase D2	HGNC:17293	19q13.2

The same results can be achieved using the Entrez Gene file gene_info.gz :

> zcat gene_info.gz | grep 9606 | cut -f1,2,3,5,8,9 |grep -e '

[^a-Z0-9]PKD2[^a-Z0-9]'

tax_id	GeneID	Symbol	Synonyms	location	description
9606	5311	PKD2	APKD2\|PC2\|PKD4\|Pc-2\|TRPP2	4q22.1	polycystic kidney disease 2 (autosomal dominant)
9606	25865	PRKD2	PKD2\|nPKC-D2	19q13.3	protein kinase D2

PKD2 is an approved symbol AND is found in the synonyms (or aliases) of another gene. If you don’t have any other information, how do you know if your gene is the polycystic kidney disease 2 (autosomal dominant) or the protein kinase D2? You need to know!

In fact, since there is no naming convention for genes, gene symbol disambiguation has become an area of research. Supervised learning, thesaurus-based methods and sets of rules are some examples of the different approaches considered to deal with this challenge.

Nonetheless, it’s always better to work with unique identifiers in parallel such as Entrez Gene ID or Ensembl ID. These can be used as a key to retrieve the symbols. You could also work the chromosomal positions. Sooner or later, although working with the identifiers or the chromosomal locations, you’ll need a mapping tool (between ids or between genome revision). But even if identifiers may change or be withdrawn and chromosomal locations may be revised, those features will always be less ambiguous than the gene symbols!

**ReferenceMD has a more technical definition. Here’s an excerpt : “A set of opposing, nonequilibrium reactions catalyzed by different enzymes which act simultaneously, with at least one of the reactions driven by ATP hydrolysis. The results of the cycle are that ATP energy is depleted, heat is produced and no net substrate-to-product conversion is achieved.”