Simple multiprocessing in R

Continuing my effort to help you get the most out of your CPUs, I figured we could look into using some multiprocessing functionality available for your R scripts. While there are a few different options for running multi-core treatments on your data, we’ll focus on something really simple to put in place.

A while back, I was putting together a script to run a large series of logistic regressions (using the glm package) in an attempt to model some data. This was fairly time consuming since a great number of these regressions had to be computed (and optimized !). Ultimately though, all of these calculation runs were independent from one another.. So obviously I decided to look for a way to parallelize the execution.

The solution I found (which fit my code structure) was through the simple replacement of the lapply function I was using by the mclapply implementation from the parallel package (which has been part of the R distribution since version 2.14). So a simple function call replacement cut my calculation time by 4 ! (roughly.. :)).

And when I say the replacement was simple, here’s what I mean:
Original piece of code:

...

gene_scores <- data.frame()
gene_scores <- do.call('rbind', lapply(genes, function(x, data, formula) {
			yvar <- all.vars(formula)[1]
			work <- data[,c(yvar, x)]
			model <- glm (formula, data=work, family=binomial)
			s <- summary(model)
			crossval <- CV_JPL(model, print.details=FALSE)
			return(data.frame(gene=x, deviance=s$deviance, 
                                          acc.cv=crossval$acc.cv, 
                                          acc.internal=crossval$acc.internal))
}, data=training, formula=formula))

...

Multicore version:

library(parallel) # Need to load the library !

...

gene_scores <- data.frame()
gene_scores <- do.call('rbind', mclapply(genes, function(x, data, formula) {
			yvar <- all.vars(formula)[1]
			work <- data[,c(yvar, x)]
			model <- glm (formula, data=work, family=binomial)
			s <- summary(model)
			crossval <- CV_JPL(model, print.details=FALSE)
			return(data.frame(gene=x, deviance=s$deviance, 
                                          acc.cv=crossval$acc.cv, 
                                          acc.internal=crossval$acc.internal))
}, data=training, formula=formula, mc.cores=4))

...

I put the code changes in bold so you can spot them easily.
They amount to:
1- load the parallel package
2- change the function call from lapply to mclapply
3- specify the number of cores to use with mc.cores=X parameter

The main caveat associated with having the parallel execution has to do with debugging/testing. Code parallelized using this library is somewhat more difficult to stop if you realize something should be fixed. But then again, simple steps such as using a reduced dataset can be taken to alleviate this.

In conclusion, this is clearly an easy modification to any code base which makes substantial use of lapply calls, so there's really no reason not to have this optimization part of your code base ! However, if that is not the case, there are a number of packages which provide parallelization to other functions, the foreach / DoMC combination comes to mind. If necessary, take a look at this page for a wealth of extra resources.

Happy coding 🙂