The language(s) of bioinformatics

The language(s) of bioinformatics

The most recurrent question I get regarding bioinformatics is unfortunately the one that leads to the least productive discussions I’ve participated in: Which programming language should I use for bioinformatics?

Don’t get me wrong, in a pub, over a beer, this can lead to some lively entertainment among the nerd intelligentsia… but rarely does it lead to enlightenment that persists in the morning!

Here, I’d like to share the current answer I have honed over the past years. It is based on developing skills in three languages, selected to fulfill complementary roles and to cover the spectrum of tasks that a bioinformatician can expect to tackle. In each role, the optimal choice of language might change with the specific context of application or with the evolution of the domain.

  1. A high-level language. It must have a rich set of libraries in your domain of application. It must allow code to be develop with little constraints. It can compromise on speed and scalability. Here, python is generally the most frequent recommendation, perl would do too! Other interesting contenders are JavaScript, Julia, Ruby, Clojure… but their small bioinformatics following means that they should be selected with more caution.
  2. An analysis language (or platform). It must enable a vast array of mathematical analyses and provide rich graphing facilities. It must support the documentation, automation and reproducibility of the data analyses steps. Here, R and matlab have strong followings, one would be selected over the other mainly based on the field of application (R in genomics and genetics, matlab for image analysis and dynamic systems).
  3. A low-level, compiled language. Here, speed of execution and fine-grain control of resource allocation are the main determinants. Mastering this last language is optional, several bioinformaticians won’t encounter problems that require that level of development through their whole career. C/C++ would be the de facto standard today, but arguments could be made in favor of Java, C#, swift or even Fortran, again depending on the context of application. It is often a very effective strategy to use this lower level language to build extensions to the language selected for role #1, thus focusing low-level code development only to small compute- or memory-intensive parts of a system.

Some will object to the significant investment that represents learning three languages, but the alternative is the risk of becoming stuck implementing a certain type of solution in an appropriate language. I’ve witnessed this happen over an over: a web-service in R, a neural network in vanilla python (pre-numpy) or regex-based file parser in C++… Even worse are the lost opportunities resulting from deciding to not pursue (or severely cripple) a given experiment because it is impractical due to language limitation.

And, to the few among you claiming that three languages are not enough, well… hobbies are a beautiful thing!

By | 2017-04-29T16:59:22+00:00 April 18, 2016|Categories: Bioinformatics|Tags: |0 Comments

About the Author:

Enjoys turning raw (big, noisy and convoluted!) biological data into knowledge… using any tools that informatics has to offer!

Leave A Comment