Usually, there is more than one way to accomplish a task. Some are better, some are worse and others are just as good. Assessing which one to use is often related to the computing time, the ease of use and/or to personal preferences and abilities.

Say I have a matrix of thousands of chromosomal features with the following column names : Feature, Start, End. All the positions are found on the same chromosome and the widths of my features are variable. For a list of given SNVs positions, I would like to know which one are overlapping my features.

I could construct two bed/gtf files from those and use the intersectBed of the bedtools. However, since I’m a big fan of R, I’ll stick to three ways to do this in R.

1. The nested for loops

This is the beginner solution. Going through all my SNVs (first for loop) and for each SNV, go through all the features to see if one feature includes the SNV. If at least one of them does, return true. There’s nothing wrong with this idea. Except that nested loops are not your friends in R. It will take forever to run if you have two big datasets.

This solution is easy, but you’ll quickly become impatient if you need your results fast.


for (i in nrow(snvs)) {
bool <- c()
for (j in nrow(features)){
b <- snvs[i, "position"] > features[j, "Start"] & snvs[i, "position"] < features[j, "Stop"]
bool <- c(bool, b)
}
snvs[i,'found'] <- any(bool)
}


2. The apply of apply syntax

In R, you can always replace a for loop using the function apply which is much faster. You could go over your SNVs and features with nested apply.

On Linux or Mac, you can use the package parallel and run apply on multiple cores. This is much faster but far less easy to write.



compare <- function(p, x) {return (x > as.numeric(p['start']) & x < as.numeric(p['end']))}
search <- function(x, features){
b <- apply(features, 1, compare, x=x)
return (any(b))
}
rres <- cbind (as.matrix(snvs), found=apply(snvs, 1, search, features=features))



3. Bioconductor IRanges package

The third option I want to present takes advantages of the R developer community. Let the others work for you! The BioConductor IRanges packages implement functions that handle genomic ranges (or ranges). It's quite easy to understand and to use. And it's also quite fast. Moral of this story: Before writing everything from scratch, take the time to google it and see if there's a package out there that could help you.



ir = IRanges (start = starts, end=stops)
sr = IRanges (start = snvs, end = snvs)
findOverlaps(sr, ir)


By | 2017-05-01T10:25:02+00:00 January 15, 2015|Categories: Bioinformatics, R|0 Comments