Bootstraps and Confidence Intervals

When analyzing data, you might want or need to fit a specific curve to a particular dataset. This type of analysis can result in instructive outputs regarding the relationship between two (or more...) quantifiable parameters. The main object of this post is not how to implement such fitting, but rather how to display the goodness of such a fit i.e. how to calculate a confidence interval around a fitted curve. That being said, I will show how to do curve fitting in [...]

By | 2016-11-08T09:30:03+00:00 September 29, 2016|Categories: Data Analysis, Data Visualization, R|1 Comment

Simple multiprocessing in R (2nd edition)

The last time I spoke about this subject, I presented a really simple way to change an lapply call into its multicore sibling mclapply. Now while this is an extremely easy modification to implement in your code to gain substantial performance benefits, it kinda required you to be making use of the lapply function in the first place. So let's look at another way to introduce multiprocessing into your existing codebase with the use of the foreach and doMC packages. [...]

By | 2016-09-20T15:53:39+00:00 September 19, 2016|Categories: Performance, R|0 Comments

Fastest method to compute an AUC

Context: AUC is an acronym for "Area Under the (ROC) Curve". If you are not familiar with the ROC curve and AUC, I suggest reading this blog post before to continuing further. For several projects, I needed to compute a large number of AUC. It started with 25,000, increased to 230,000 and now I need to compute 1,500,000 AUC. With so many AUC, the time to compute each one becomes critical. On the web, I don't find much information about this specific [...]

By | 2016-11-08T09:30:03+00:00 August 18, 2016|Categories: Data Analysis, Performance, Python, R, Statistics|0 Comments

Speed up random disk access

When working with a software that accesses data from disk in a random fashion, it is common knowledge that best performance will be reached using SSD hard drives, with SAS disks being less efficient and SATA disks being the worst. However, high capacity SSD drives are still relatively expensive and thus, when working with large datasets, one typically ends up working with data stored on larger, and more common SATA drives. I recently experimented with the Jellyfish software to analyze [...]

By | 2016-11-08T09:30:03+00:00 August 4, 2016|Categories: Performance|0 Comments

Bioinformatic in a container

A recent tendency coming from the world of cloud computing is gaining more and more popularity in the bioinformatic community. This tendency is to develop and deploy application in a container. This container contains not only the application but all the needed libraries and a minimalist version of the applications of the operating system. As soon as it is built, the container is ready for use on a host computer containing the environment required to start the container. For a [...]

By | 2016-11-08T09:30:04+00:00 July 21, 2016|Categories: Bioinformatics|0 Comments