As bioinformaticians, we often have to work with data that are not formatted the way we would need them to be. One case we might encounter is receiving data in a “long” format instead of receiving them in a more familiar “wide” format. For those of you familiar with the ggplot R package, you know this format very well. It’s the format required by ggplot to produce its nice graphs.
genes S01 S02 S03 1 A2M 1.055730 17.104854 14.456857 2 BAD 7.525395 9.020534 14.099557 3 TP53 5.175253 4.437941 3.827848
Table1. Small dataset arranged in long (left) and wide (right) formats.
Starting from the long format, how can we get a wide matrix (where the genes are found as rows and the samples are found as columns) ?
There are several ways to do the conversion, but first let’s generate some data so you can try pivoting tables on your own.
We can generate a small toy dataset containing 3 genes and 3 samples in R using the
# repeat each item in the vector n times genes <- rep(c('BAD', 'TP53', 'A2M'),each=3) # "BAD" "BAD" "BAD" "TP53" "TP53" "TP53" "A2M" "A2M" "A2M" # repeat the vector n times samples <- rep(c('S01', 'S02', 'S03'),times=3) # "S01" "S02" "S03" "S01" "S02" "S03" "S01" "S02" "S03" # choose numbers from normal distribution randomly expression <- abs(rnorm(9, 1, 10)) # 17.623573 13.709192 4.529254 3.056479 2.407857 15.619984 2.309779 # 8.497227 14.941006 df <- as.data.frame(cbind(genes, samples, expression)) df$expression <- as.numeric(as.character(df$expression))
This will generate the long format matrix presented in Table1. We can use the same idea to generate a big dataset of 10,000 genes and 1,000 samples*.
ngenes <- 10000 nsamples <- 1000 genes <- rep(paste('G',seq(ngenes), sep=''),each=nsamples) samples <- rep(paste('S',seq(nsamples), sep=''),times=ngenes) expression <- abs(rnorm(ngenes*nsamples, 1, 10)) df <- as.data.frame(cbind(genes, samples, expression)) df$expression <- as.numeric(as.character(df$expression)) write.table(df, file='longbig.txt', sep='\t', quote=F)
Pivoting in R
In R, the most used packages to pivot tables are : tidyr and reshape2. The functions
dcast from reshape2 and
spread from tidyr are quite easy to use. We just need to specify the data to pivot, how to organize the rows and columns and what values should be used to fill the matrix.
dcast function expects a formula stating which variables should be used as row identifiers and which variable to put in columns. The
var.value parameters indicates which value should fill the matrix. If there is more than one variable to identify each row, the formula will look like :
varRow1 + varRow2 ~ varCol . For example,
genes + organism ~ samples would put each sample as a column and have the gene and organism as row identifier.
spread the only variables to specify are the ones to distribute as columns and the one to use to fill the matrix.
library(reshape2) tb <- dcast(df, genes ~ samples, value.var="expression") library(tidyr) tb <- spread(df, samples, expression)
For the big dataset of 10,000 genes and 1,000 samples, the long format matrix contains 10,000,000 rows and 3 columns. It takes about 60-90 seconds to read this 331 MB file in R and 10 seconds to pivot the table.
Pivoting in Python
It is just as easy to go from long to wide in python using the pandas package. This library has a lot of useful functions for data analysis. Actually, since the addition nice R-like visualization functions, I've found myself using it more and more instead of using R.
Here is how I would pivot with pandas in python :
import pandas df = pandas.read_csv('longbig.txt', sep='\t') tb = pandas.pivot_table(df, values="expression", index=["genes"], columns="samples")
Reading the file is much faster but the pivoting step takes longer. In the end, pivoting the big dataset takes pretty much the same time as it took in R.
pivot_table and R's
dcast functions can also be used to aggregate data i.e. to reduce data dimensions using functions such as sum or mean. By default,
dcast will use the length (or count) and
pivot_table will use the mean as aggregation function, but this is up to you. If there is more than one row with the same identifier, the aggregation function will be applied automatically. For example, in the following code, ignoring the 'samples' in the pivoting step (ie: asking for only one column for 'expression') results in having more than one expression value per gene. Hence, a mean is computed.
tab = pandas.pivot_table(df, values="expression", index=["genes"]) ## Mean of gene in samples #genes #G1 7.959172 #G10 8.017683 #G100 8.199654 #G1000 8.059022 #... df.ix[df.genes=='G1',:].mean() # Checking that the output is the mean #expression 7.959172 #dtype: float64
To use aggregation on non-numeric values, don't forget to define your own aggregation function !
* If this step takes to much time in R, switch to python (pandas is required here):
import random import pandas ngenes = 10000 nsamples = 1000 genes = ['G%i' %i for i in range(ngenes)]*nsamples samples = ['S%i' %i for i in range(nsamples)]*ngenes samples.sort() expression = [abs(random.gauss(1,10)) for i in range(nsamples*ngenes)] df = pandas.DataFrame(zip(genes, samples), columns=['genes', 'samples']) df['expression'] = expression df.to_csv('longbig.txt', sep='\t')