Pivoting tables : from long to wide

As bioinformaticians, we often have to work with data that are not formatted the way we would need them to be. One case we might encounter is receiving data in a "long" format instead of receiving them in a more familiar "wide" format. For those of you familiar with the ggplot R package, you know this format very well. It's the format required by ggplot to produce its nice graphs.   Long genes samples expression 1 BAD S01 7.525395 2 [...]

By | 2017-04-29T18:11:56+00:00 November 14, 2016|Categories: Data Analysis, Python, R|Tags: |0 Comments

Implementing a “Siamese” Neural Network with Mariana 1.0

Mariana was previously introduced in this blog by Geneviève in her May post Machine learning in life science. The Mariana codebase is currently standing on github at the third release candidate before the launch of the stable 1.0 release. This new version incorporates a large refactorization effort as well as many new features (a complete list of the changes found in the 1.0 version can be found in the changelog). I am taking this opportunity to present here a small tutorial on extending the [...]

By | 2017-04-29T16:24:07+00:00 November 7, 2016|Categories: Machine learning, Python|Tags: , , , |0 Comments

Fast network transfers?

Recently, everyone and their mother started using various tools in order to optimize large data transfer to, from and between supercomputers. Historically, we have seen tools like FDT, BBCP that tried to exceed the performance obtained from other transfer methods, like scp, rsync, ftp, etc. One tool in particular is now gaining traction and is being deployed on most supercomputers: GridFTP and its front-end Globus. The Globus frontend interface. Before jumping into the bandwagon, I thought it would [...]

By | 2017-04-29T17:04:17+00:00 October 13, 2016|Categories: Computer science, Performance|Tags: , |0 Comments

Bootstraps and Confidence Intervals

When analyzing data, you might want or need to fit a specific curve to a particular dataset. This type of analysis can result in instructive outputs regarding the relationship between two (or more...) quantifiable parameters. The main object of this post is not how to implement such fitting, but rather how to display the goodness of such a fit i.e. how to calculate a confidence interval around a fitted curve. That being said, I will show how to do curve fitting in [...]

By | 2017-04-29T18:33:55+00:00 September 29, 2016|Categories: Data Analysis, R, Statistics|Tags: |1 Comment

Simple multiprocessing in R (2nd edition)

The last time I spoke about this subject, I presented a really simple way to change an lapply call into its multicore sibling mclapply. Now while this is an extremely easy modification to implement in your code to gain substantial performance benefits, it kinda required you to be making use of the lapply function in the first place. So let's look at another way to introduce multiprocessing into your existing codebase with the use of the foreach and doMC packages. [...]

By | 2017-04-29T16:24:45+00:00 September 19, 2016|Categories: Performance, R|Tags: , |0 Comments