Let’s get to know my top 10 R’s neat little functions and tricks that make our life easier when manipulating data in R.
Sequences
Want to make long sequences of numbers or letters but don’t feel like writing them all out into a vector? This function lets you concatenate strings and variables. What is important to know here is that for paste you need to specify the separator. Want R to put output files or get output files from a directory specifically? Ever had a situation when you want only part of a string? Two useful functions can help you deal with this. String split creates a list of features. You must select the one you want: If your separator is a special character, that is a character having a special meaning in a regular expression (like . which means “any” or | which means “or”), you need to use the “\\” or use the option Other example: Membership of lists Want to select a specific column in a dataframe but don’t feel like counting them to know which index to use? That’s what column names and “$” are for! Another example is selecting features in an object. To check how the features are called, use To get a series of random numbers coming from a normal distribution, call If you are lost and not sure of what a function does, typing If you want to put all letters in a string to upper or lower case,
R let’s you make a sequence with “:” for numbers. You can also use seq()
if you are looking for a regular sequence that is not incremented by one. letters
LETTERS[]
instead of letters ☺
> 1:5
[1] 1 2 3 4 5
> seq(1, 10, 2)
[1] 1 3 5 7 9
> letters[1:5]
[1] "a" "b" "c" "d" "e"
> LETTERS[1:5]
[1] "A" "B" "C" "D" "E"
Paste
One cool thing about paste()
, is that it can be put inside other functions and loops.
For example you want to make a loop, and for each turn of the loop, you want to output a file, named differently (file1, file2, file3). Here is how you go about it:for (i in 1:10){
write.table(x[i],paste("file", i, ".txt", sep="")
}
> for (i in 1:3) {
print(paste("file", i, ".txt", sep=""))
}
[1] "file1.txt"
[1] "file2.txt"
[1] "file3.txt"
> for (i in 1:3) {
print(paste("file", i, ".txt", sep="_"))
}
[1] "file_1_.txt"
[1] "file_2_.txt"
[1] "file_3_.txt"
Working directories
Look at you, being all organized!
getwd()
, for “get working directory”, a function that will print out the current working directory. It’s brother, setwd()
, sets the directory to something you specify.Substring and stringsplit
When you have a known mark where you would like to truncate your string, you use strsplit()
.
When you want everything to be a specific length, you can use substr()
.
For example:
Gene names that go like this “CPB1_1360” need to be split using the “_
” character. strsplit("CPB1_1360", split="_")
> strsplit("CPB1_1360", split="_")[[1]][1]
[1] "CPB1"
> strsplit("CPB1_1360", split="_")[[1]][2]
[1] "1360"
fixed=TRUE
.strsplit("CPB1|1360", split="\\|")
strsplit("CPB1|1360", split="|", fixed=TRUE)
Patient barcodes that go like this
“TCGA.E2.A154.01A.11R.A115.07” that you want to shorten to say TCGA-XX-XXXX, you use substr()
like this:substr("TCGA.E2.A154.01A.11R.A115.07", start=0, stop=12)
[1] "TCGA.E2.A154"
Membership of lists can be tested. When you have two lists, a
and b
, and want to know which elements in list a
are in list b
? The syntax is a%in%b
. This returns a vector of boolean values (true/false) of the same length of a
, that specifies if each value of a
is in b
. Careful, the reverse needs to be written b%in%a
. Here is an example:> a = c("A","B","C")
> b = c("C","D","R")
> a%in%b
[1] FALSE FALSE TRUE
> b%in%a
[1] TRUE FALSE FALSE
Selection
> a
bcr_patient_barcode er_status_by_ihc pr_status_by_ihc her2_status_by_ihc
1 TCGA-AR-A1AR Negative Negative Negative
2 TCGA-BH-A1EO Positive Positive Negative
3 TCGA-BH-A1ES Positive Positive Negative
4 TCGA-BH-A1ET Positive Positive Negative
> a$er_status_by_ihc
[1] "Negative" "Positive" "Positive" "Positive"
For example, you want to output the p-value of a t.test: t.test$p.value
.> t.test(e1[,1],e2[,2])
Welch Two Sample t-test
data: e1[, 1] and e2[, 2]
t = 0.0581, df = 1645.314, p-value = 0.9537
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.02670277 0.02833160
sample estimates:
mean of x mean of y
0.5019641 0.5011497
> t.test(e1[,1],e2[,2])$p.value
[1] 0.9537151
names(object)
like this:> names(t.test(e1[,1],e2[,2]))
[1] "statistic" "parameter" "p.value" "conf.int" "estimate" "null.value" "alternative"
[8] "method" "data.name"
Distributions and random sampling
rnorm()
, specifying the number of variables, the mean and the standard deviation.Help
?
in front of the function name opens up the help manual for that function, giving you the details for it.Upper case, lower case
toupper()
and tolower()
is very handy.> toupper("RealLy UglY fOrmaT")
[1] "REALLY UGLY FORMAT"
> tolower("MAke Me sMALLeR")
[1] "make me smaller"
Repetitions and patterns
rep()
lets you make custom repetitions. You can specify a pattern and then either specify the number of repeats or specify the total output length with length.out
.
Let’s take the most repetitive pop song as an example (repeat 4 times with a high-pitched voice)> rep(c("I'm like a bird, I'll only fly away", "I don't know where my soul is, I don't know where my home is"),time=4)
[1] "I'm like a bird, I'll only fly away"
[2] "I don't know where my soul is, I don't know where my home is"
[3] "I'm like a bird, I'll only fly away"
[4] "I don't know where my soul is, I don't know where my home is"
[5] "I'm like a bird, I'll only fly away"
[6] "I don't know where my soul is, I don't know where my home is"
[7] "I'm like a bird, I'll only fly away"
[8] "I don't know where my soul is, I don't know where my home is"
Leave A Comment