As bioinformaticians, we often have to work with data that are not formatted the way we would need them to be. One case we might encounter is receiving data in a “long” format instead of receiving them in a more familiar “wide” format. For those of you familiar with the ggplot R package, you know this format very well. It’s the format required by ggplot to produce its nice graphs.

 

Long


  genes   samples expression
1   BAD     S01    7.525395
2   BAD     S02    9.020534
3   BAD     S03   14.099557
4  TP53     S01    5.175252
5  TP53     S02    4.437941
6  TP53     S03    3.827847
7   A2M     S01    1.055729
8   A2M     S02   17.104853
9   A2M     S03   14.456857
Wide


  genes      S01       S02       S03
1   A2M 1.055730 17.104854 14.456857
2   BAD 7.525395  9.020534 14.099557
3  TP53 5.175253  4.437941  3.827848

Table1. Small dataset arranged in long (left) and wide (right) formats.

Starting from the long format, how can we get a wide matrix (where the genes are found as rows and the samples are found as columns) ?

There are several ways to do the conversion, but first let’s generate some data so you can try pivoting tables on your own.

Generating data

We can generate a small toy dataset containing 3 genes and 3 samples in R using the rep and rnorm functions.


# repeat each item in the vector n times
genes = rep(c('BAD', 'TP53', 'A2M'),each=3)  
#[1] "BAD"  "BAD"  "BAD"  "TP53" "TP53" "TP53" "A2M"  "A2M"  "A2M"

# repeat the vector n times
samples = rep(c('S01', 'S02', 'S03'),times=3)  
#[1] "S01" "S02" "S03" "S01" "S02" "S03" "S01" "S02" "S03"

# choose numbers from normal distribution randomly
expression = abs(rnorm(9, 1, 10))
#[1] 17.623573 13.709192  4.529254  3.056479  2.407857 15.619984  2.309779
#[8]  8.497227 14.941006

df = as.data.frame(cbind(genes, samples, expression))
df$expression = as.numeric(as.character(df$expression)) 

This will generate the long format matrix presented in Table1. We can use the same idea to generate a big dataset of 10,000 genes and 1,000 samples*.


ngenes = 10000
nsamples = 1000
genes = rep(paste('G',seq(ngenes), sep=''),each=nsamples)
samples = rep(paste('S',seq(nsamples), sep=''),times=ngenes)
expression = abs(rnorm(ngenes*nsamples, 1, 10))
df = as.data.frame(cbind(genes, samples, expression))
df$expression = as.numeric(as.character(df$expression))
write.table(df, file='longbig.txt', sep='\t', quote=F)

Pivoting in R

In R, the most used packages to pivot tables are : tidyr and reshape2. The functions dcast from reshape2 and spread from tidyr are quite easy to use. We just need to specify the data to pivot, how to organize the rows and columns and what values should be used to fill the matrix.

The dcast function expects a formula stating which variables should be used as row identifiers and which variable to put in columns. The var.value parameters indicates which value should fill the matrix. If there is more than one variable to identify each row, the formula will look like : varRow1 + varRow2 ~ varCol . For example, genes + organism ~ samples would put each sample as a column and have the gene and organism as row identifier.

With spread the only variables to specify are the ones to distribute as columns and the one to use to fill the matrix.


library(reshape2)
tb = dcast(df, genes ~ samples, value.var="expression")

library(tidyr)
tb = spread(df, samples, expression) 

For the big dataset of 10,000 genes and 1,000 samples, the long format matrix contains 10,000,000 rows and 3 columns. It takes about 60-90 seconds to read this 331 MB file in R and 10 seconds to pivot the table.

Pivoting in Python

It is just as easy to go from long to wide in python using the pandas package. This library has a lot of useful functions for data analysis. Actually, since the addition nice R-like visualization functions, I’ve found myself using it more and more instead of using R.

Here is how I would pivot with pandas in python :

import pandas
df = pandas.read_csv('longbig.txt', sep='\t')
tb = pandas.pivot_table(df, values="expression", index=["genes"], columns="samples")

Reading the file is much faster but the pivoting step takes longer. In the end, pivoting the big dataset takes pretty much the same time as it took in R.

Aggregation

Python’s pivot_table and R’s dcast functions can also be used to aggregate data i.e. to reduce data dimensions using functions such as sum or mean. By default, dcast will use the length (or count) and pivot_table will use the mean as aggregation function, but this is up to you. If there is more than one row with the same identifier, the aggregation function will be applied automatically. For example, in the following code, ignoring the ‘samples’ in the pivoting step (ie: asking for only one column for ‘expression’) results in having more than one expression value per gene. Hence, a mean is computed.

tab = pandas.pivot_table(df, values="expression", index=["genes"])

## Mean of gene in samples
#genes
#G1        7.959172
#G10       8.017683
#G100      8.199654
#G1000     8.059022
#...

df.ix[df.genes=='G1',:].mean() # Checking that the output is the mean
#expression    7.959172
#dtype: float64

To use aggregation on non-numeric values, don’t forget to define your own aggregation function !

* If this step takes to much time in R, switch to python (pandas is required here):

import random 
import pandas 
ngenes = 10000
nsamples =  1000
genes = ['G%i' %i for i in range(ngenes)]*nsamples
samples = ['S%i' %i for i in range(nsamples)]*ngenes
samples.sort()
expression =  [abs(random.gauss(1,10)) for i in range(nsamples*ngenes)]
df = pandas.DataFrame(zip(genes, samples), columns=['genes', 'samples'])
df['expression'] = expression
df.to_csv('longbig.txt', sep='\t')