As bioinformaticians, we often have to work with data that are not formatted the way we would need them to be. One case we might encounter is receiving data in a “long” format instead of receiving them in a more familiar “wide” format. For those of you familiar with the ggplot R package, you know this format very well. It’s the format required by ggplot to produce its nice graphs.
genes samples expression 1 BAD S01 7.525395 2 BAD S02 9.020534 3 BAD S03 14.099557 4 TP53 S01 5.175252 5 TP53 S02 4.437941 6 TP53 S03 3.827847 7 A2M S01 1.055729 8 A2M S02 17.104853 9 A2M S03 14.456857
genes S01 S02 S03 1 A2M 1.055730 17.104854 14.456857 2 BAD 7.525395 9.020534 14.099557 3 TP53 5.175253 4.437941 3.827848
Table1. Small dataset arranged in long (left) and wide (right) formats.
Starting from the long format, how can we get a wide matrix (where the genes are found as rows and the samples are found as columns) ?
There are several ways to do the conversion, but first let’s generate some data so you can try pivoting tables on your own.
Generating data
We can generate a small toy dataset containing 3 genes and 3 samples in R using the rep
and rnorm
functions.
# repeat each item in the vector n times
genes = rep(c('BAD', 'TP53', 'A2M'),each=3)
#[1] "BAD" "BAD" "BAD" "TP53" "TP53" "TP53" "A2M" "A2M" "A2M"
# repeat the vector n times
samples = rep(c('S01', 'S02', 'S03'),times=3)
#[1] "S01" "S02" "S03" "S01" "S02" "S03" "S01" "S02" "S03"
# choose numbers from normal distribution randomly
expression = abs(rnorm(9, 1, 10))
#[1] 17.623573 13.709192 4.529254 3.056479 2.407857 15.619984 2.309779
#[8] 8.497227 14.941006
df = as.data.frame(cbind(genes, samples, expression))
df$expression = as.numeric(as.character(df$expression))
This will generate the long format matrix presented in Table1. We can use the same idea to generate a big dataset of 10,000 genes and 1,000 samples*.
ngenes = 10000
nsamples = 1000
genes = rep(paste('G',seq(ngenes), sep=''),each=nsamples)
samples = rep(paste('S',seq(nsamples), sep=''),times=ngenes)
expression = abs(rnorm(ngenes*nsamples, 1, 10))
df = as.data.frame(cbind(genes, samples, expression))
df$expression = as.numeric(as.character(df$expression))
write.table(df, file='longbig.txt', sep='\t', quote=F)
Pivoting in R
In R, the most used packages to pivot tables are : tidyr and reshape2. The functions dcast
from reshape2 and spread
from tidyr are quite easy to use. We just need to specify the data to pivot, how to organize the rows and columns and what values should be used to fill the matrix.
The dcast
function expects a formula stating which variables should be used as row identifiers and which variable to put in columns. The var.value
parameters indicates which value should fill the matrix. If there is more than one variable to identify each row, the formula will look like : varRow1 + varRow2 ~ varCol
. For example, genes + organism ~ samples
would put each sample as a column and have the gene and organism as row identifier.
With spread
the only variables to specify are the ones to distribute as columns and the one to use to fill the matrix.
library(reshape2)
tb = dcast(df, genes ~ samples, value.var="expression")
library(tidyr)
tb = spread(df, samples, expression)
For the big dataset of 10,000 genes and 1,000 samples, the long format matrix contains 10,000,000 rows and 3 columns. It takes about 60-90 seconds to read this 331 MB file in R and 10 seconds to pivot the table.
Pivoting in Python
It is just as easy to go from long to wide in python using the pandas package. This library has a lot of useful functions for data analysis. Actually, since the addition nice R-like visualization functions, I’ve found myself using it more and more instead of using R.
Here is how I would pivot with pandas in python :
import pandas
df = pandas.read_csv('longbig.txt', sep='\t')
tb = pandas.pivot_table(df, values="expression", index=["genes"], columns="samples")
Reading the file is much faster but the pivoting step takes longer. In the end, pivoting the big dataset takes pretty much the same time as it took in R.
Aggregation
Python’s pivot_table
and R’s dcast
functions can also be used to aggregate data i.e. to reduce data dimensions using functions such as sum or mean. By default, dcast
will use the length (or count) and pivot_table
will use the mean as aggregation function, but this is up to you. If there is more than one row with the same identifier, the aggregation function will be applied automatically. For example, in the following code, ignoring the ‘samples’ in the pivoting step (ie: asking for only one column for ‘expression’) results in having more than one expression value per gene. Hence, a mean is computed.
tab = pandas.pivot_table(df, values="expression", index=["genes"])
## Mean of gene in samples
#genes
#G1 7.959172
#G10 8.017683
#G100 8.199654
#G1000 8.059022
#...
df.ix[df.genes=='G1',:].mean() # Checking that the output is the mean
#expression 7.959172
#dtype: float64
To use aggregation on non-numeric values, don’t forget to define your own aggregation function !
* If this step takes to much time in R, switch to python (pandas is required here):
import random
import pandas
ngenes = 10000
nsamples = 1000
genes = ['G%i' %i for i in range(ngenes)]*nsamples
samples = ['S%i' %i for i in range(nsamples)]*ngenes
samples.sort()
expression = [abs(random.gauss(1,10)) for i in range(nsamples*ngenes)]
df = pandas.DataFrame(zip(genes, samples), columns=['genes', 'samples'])
df['expression'] = expression
df.to_csv('longbig.txt', sep='\t')
Leave A Comment