python and pandas

python and pandas

R is undeniably a must-use language. Especially for data visualization. But R can sometimes be a little bit slow when dealing with big datasets. If you don’t need to create awesome graphs or don’t have time to wait, there’s an alternative in Python that can be quite fast for data manipulation. The Python Data Analysis Library, pandas, provides an easy way to manipulate data in python.

Recently, I had to deal with a big gene expression file (21024 genes x 3081 samples) derived from The Cancer Genome Atlas.

The RPKM were log-tranformed and I wanted to get back the untransformed (raw) data. The transformation applied to the data was : log10 (1000* RPKM +1). I was in a hurry so I decided to use pandas instead of R. :

Pandas offers two main data structures : Series and Dataframe. Series can be compared to R’s vectors and Dataframe can be compared to, well, R’s dataframe.

The package allows indexing and manipulation of the data and provides several functions whose purpose is to make our life easier. For example, dataframes can be created directly from dictionaries.

I’ve listed some basic functions implemented in pandas and their equivalent in R below.

With pandas With R Description
df.head() head(df) Display first lines
df.tail() tail(df) Display last lines
df.describe() summary(df) Get statistics
df.shape() dim(df) Get dimensions
df
[df['col']=='valueForRow']
df[df$col=='valueForRow',] Retrieve a given row
df.ix[0:2,1] df[0:2,1] Slicing
df.index rownames(df) Get row names
df.columns colnames(df) Get column names

For my gene expression data situation, this is what I did:


import pandas as pd   

df = pd.read_csv('TCGA.txt', sep='\t', dtype='object', header=0)

raw = (10**df[range(1,len(df.columns))].astype('float64')-1)/1000

raw_col = raw.columns
raw['Gene'] = df['Gene']
raw = raw[['Gene'] + list(raw_col)]

raw.to_csv('TCGA.raw.txt', sep='\t')

Loading the data on my computer took about 22.6 seconds, transforming it, 14.6 seconds and writing it back to a text file, 45.1 seconds.

You'll find more information in the pandas documentation . Have fun !

By | 2017-04-29T15:49:18+00:00 April 17, 2014|Categories: Data Analysis, Python|Tags: , |1 Comment

Share This Story, Choose Your Platform!

About the Author:

I’ve started in biochemistry but it is as a bioinformatician that I’ve been having fun for several years now : whether doing data analysis and visualization in R, building interactive web interfaces in javascript or exploring machine learning in python.

One Comment

  1. Patrick Gendron April 22, 2014 at 07:28 - Reply

    Pretty impressive speed!

    Especially if you consider that the equivalent awk script runs in as much as 70 seconds:

    awk ‘{for (i=1; i<=NF; i++) printf("%.2f\t", ((10**$i)-1)/1000); printf("\n")}'

Leave A Comment