Simple multiprocessing in R (2nd edition)

The last time I spoke about this subject, I presented a really simple way to change an lapply call into its multicore sibling mclapply. Now while this is an extremely easy modification to implement in your code to gain substantial performance benefits, it kinda required you to be making use of the lapply function in the first place. So let’s look at another way to introduce multiprocessing into your existing codebase with the use of the foreach and doMC packages.

The foreach package is key here because it implements a looping construct which does not use a loop counter, thus making it usable in a parallel execution context. As an added benefit to all Python fans out there, foreach just might give you the impression you’re working with list comprehensions 🙂

OK, let’s have a look at some code examples using the RandomForest package as that will allow us to touch on a few key aspects of the foreach construct.

As stated in the randomForest reference manual regarding the ntree parameter:

Number of trees to grow. This should not be set to too small a number, to ensure
that every input row gets predicted at least a few times

Of course, raising the value of ntree will result in a higher compute time… But with the help of foreach and doMC, we’ll be able to split the treatment over multiple CPUs.

This snippet:

library("randomForest")

... # generate your training set

rf <- randomForest(x=x, y=y, ntree=1000)

Becomes:

library("randomForest")
library(foreach)
library(doMC)
registerDoMC(4)

... # generate your training set

rf <-  foreach(ntree=rep(250, 4), .combine=combine, .packages='randomForest') %dopar% {

    randomForest(x=x, y=y, ntree=ntree)

}

A few key things are taking place here:

We load the foreach and doMC libraries
We register a number of CPUs available to doMC # Don’t forget this step or execution will take place sequentially!
We specify that the package randomForest should be loaded in the parallel execution context using the .packages option of foreach
We merge the results of each parallel run by passing the combine method of randomForest to the .combine option of foreach.

That was easy enough, right ? And we now ensure that we can compute the necessary number of trees for our data set to be thoroughly explored while keeping everything within a reasonable computing time !

Merging results from foreach runs, while not necessary per se as foreach will return a list of results, can be done with many common functions such as c (concat), cbind/rbind or even your own user defined function !

As an added bonus, the doMC library can easily be swapped out by other do* implementations based on the type of job distribution you would like to use. Some of the alternatives, such as doMPI or doSNOW will distribute your jobs on a computing cluster with minimal alteration to the code snippets above.

As further readings, you might be interested in learning how to nest foreach loops and use the when operator to prevent some evaluations to take place… Much like the if statement from python list comprehensions.

So go ahead, save some real-world time by distributing your workload and explore your data more thoroughly !