Beginner R: functions that make your life easier

Let’s get to know my top 10 R’s neat little functions and tricks that make our life easier when manipulating data in R.

Sequences

Want to make long sequences of numbers or letters but don’t feel like writing them all out into a vector?
R let’s you make a sequence with “:” for numbers. You can also use seq() if you are looking for a regular sequence that is not incremented by one. letters

[] let’s you make continuous letter sequences, in order or in reverse and starting from where you want in the alphabet. For uppercase letters, you just need to call LETTERS[] instead of letters ☺

> 1:5
[1] 1 2 3 4 5

> seq(1, 10, 2)
[1] 1 3 5 7 9

> letters[1:5]
[1] "a" "b" "c" "d" "e"

> LETTERS[1:5]
[1] "A" "B" "C" "D" "E"

Paste

This function lets you concatenate strings and variables.
One cool thing about paste(), is that it can be put inside other functions and loops.
For example you want to make a loop, and for each turn of the loop, you want to output a file, named differently (file1, file2, file3). Here is how you go about it:

for (i in 1:10){ 
    write.table(x[i],paste("file", i, ".txt", sep="") 
}

What is important to know here is that for paste you need to specify the separator.

> for (i in 1:3) {
   print(paste("file", i, ".txt", sep=""))
}
[1] "file1.txt"
[1] "file2.txt"
[1] "file3.txt"

> for (i in 1:3) {
   print(paste("file", i, ".txt", sep="_"))
}
[1] "file_1_.txt"
[1] "file_2_.txt"
[1] "file_3_.txt"

Working directories

Want R to put output files or get output files from a directory specifically?
Look at you, being all organized!
getwd(), for “get working directory”, a function that will print out the current working directory. It’s brother, setwd(), sets the directory to something you specify.

Substring and stringsplit

Ever had a situation when you want only part of a string? Two useful functions can help you deal with this.
When you have a known mark where you would like to truncate your string, you use strsplit().
When you want everything to be a specific length, you can use substr().
For example:
Gene names that go like this “CPB1_1360” need to be split using the “_” character.

strsplit("CPB1_1360", split="_")

String split creates a list of features. You must select the one you want:

> strsplit("CPB1_1360", split="_")[[1]][1]
[1] "CPB1"

> strsplit("CPB1_1360", split="_")[[1]][2]
[1] "1360"

If your separator is a special character, that is a character having a special meaning in a regular expression (like . which means “any” or | which means “or”), you need to use the “\\” or use the option fixed=TRUE.

strsplit("CPB1|1360", split="\\|")
strsplit("CPB1|1360", split="|", fixed=TRUE)

Other example:
Patient barcodes that go like this
“TCGA.E2.A154.01A.11R.A115.07” that you want to shorten to say TCGA-XX-XXXX, you use substr() like this:

substr("TCGA.E2.A154.01A.11R.A115.07", start=0, stop=12)
[1] "TCGA.E2.A154"

Membership of lists
Membership of lists can be tested. When you have two lists, a and b, and want to know which elements in list a are in list b? The syntax is a%in%b. This returns a vector of boolean values (true/false) of the same length of a, that specifies if each value of a is in b. Careful, the reverse needs to be written b%in%a. Here is an example:

> a = c("A","B","C")
> b = c("C","D","R")
> a%in%b
[1] FALSE FALSE TRUE

> b%in%a
[1] TRUE FALSE FALSE

Selection

Want to select a specific column in a dataframe but don’t feel like counting them to know which index to use? That’s what column names and “$” are for!

> a
bcr_patient_barcode er_status_by_ihc pr_status_by_ihc her2_status_by_ihc
1 TCGA-AR-A1AR Negative Negative Negative
2 TCGA-BH-A1EO Positive Positive Negative
3 TCGA-BH-A1ES Positive Positive Negative
4 TCGA-BH-A1ET Positive Positive Negative

> a$er_status_by_ihc
[1] "Negative" "Positive" "Positive" "Positive"

Another example is selecting features in an object.
For example, you want to output the p-value of a t.test: t.test$p.value.

> t.test(e1[,1],e2[,2])
Welch Two Sample t-test
data: e1[, 1] and e2[, 2]
t = 0.0581, df = 1645.314, p-value = 0.9537
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02670277 0.02833160
sample estimates:
mean of x mean of y
0.5019641 0.5011497

> t.test(e1[,1],e2[,2])$p.value
[1] 0.9537151

To check how the features are called, use names(object) like this:

> names(t.test(e1[,1],e2[,2]))
[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative"
[8] "method" "data.name"

Distributions and random sampling

To get a series of random numbers coming from a normal distribution, call rnorm(), specifying the number of variables, the mean and the standard deviation.

Help

If you are lost and not sure of what a function does, typing ? in front of the function name opens up the help manual for that function, giving you the details for it.

Upper case, lower case

If you want to put all letters in a string to upper or lower case, toupper() and tolower() is very handy.

> toupper("RealLy UglY fOrmaT")
[1] "REALLY UGLY FORMAT"

> tolower("MAke Me sMALLeR")
[1] "make me smaller"

Repetitions and patterns

rep() lets you make custom repetitions. You can specify a pattern and then either specify the number of repeats or specify the total output length with length.out.
Let’s take the most repetitive pop song as an example (repeat 4 times with a high-pitched voice)

> rep(c("I'm like a bird, I'll only fly away", "I don't know where my soul is, I don't know where my home is"),time=4)
[1] "I'm like a bird, I'll only fly away"
[2] "I don't know where my soul is, I don't know where my home is"
[3] "I'm like a bird, I'll only fly away"
[4] "I don't know where my soul is, I don't know where my home is"
[5] "I'm like a bird, I'll only fly away"
[6] "I don't know where my soul is, I don't know where my home is"
[7] "I'm like a bird, I'll only fly away"
[8] "I don't know where my soul is, I don't know where my home is"