Data Analysis


Say we have the two following groups : g1 <- c(55, 65, 58) g2 <- c(12, 18, 32) We want to see if the two groups belong to the same distribution or can be considered as different groups. We might be tempted to try a Student’s t-test. t.test(g1, g2) ## Welch Two Sample t-test ## ## data: g1 and g2 ## t = 5.8366, df = 2.9412, p-value = 0.01059 ## alternative hypothesis: true difference in means is not equal [...]

By | 2017-04-30T10:15:37+00:00 October 14, 2015|Categories: Data Analysis, R, Statistics|0 Comments

Working with large files

When dealing with Next Generation Sequencing data, I am routinely asked by clients how to open sequence files. The answer is that given their huge size (often many million lines) and the consequent requirement in memory, they should probably not be opened in any way, they should only be processed. Most software designed to work with NGS data will then process these files in a sequential fashion or stream, loading just the required amount of data from disk, processing it [...]

By | 2017-04-30T10:19:35+00:00 October 1, 2015|Categories: Data Analysis, Shell scripting|Tags: , |1 Comment

Kaplan-Meier plot

When working with cancer datasets, one of the goal is sometimes to find features (mutation, clinical information, gene expression, ...) associated to prognosis, i.e. features related to the probable outcome of the disease. If that's one of your goal, you'll have to do a survival analysis.  Survival analysis involves a set of methods to model the time at which an event of interest occurs, that event often being death.  But really, any event for which the time of occurence is [...]

By | 2017-04-29T17:14:26+00:00 February 19, 2015|Categories: Data Analysis, Statistics|Tags: |0 Comments

python and pandas

R is undeniably a must-use language. Especially for data visualization. But R can sometimes be a little bit slow when dealing with big datasets. If you don't need to create awesome graphs or don't have time to wait, there's an alternative in Python that can be quite fast for data manipulation. The Python Data Analysis Library, pandas, provides an easy way to manipulate data in python. Recently, I had to deal with a big gene expression file (21024 genes x [...]

By | 2017-04-29T15:49:18+00:00 April 17, 2014|Categories: Data Analysis, Python|Tags: , |1 Comment

lifelines (or doing survival analysis in Python)

Lately, I've been doing survival analysis.  I'm not an expert but we had a self-learning group based on David G. Kleinbaum and Mitchel Klein’s  book,   "Survival Analysis. A Self-Learning Text" .  At the end of this book, there's code provided to help you get started in SAS, Stata, SPSS and... R!  I've played with the R package survival which is quite good!  My problem was that I wanted to do survival analysis in Python.  I've started by doing it with [...]

By | 2017-04-29T17:16:41+00:00 March 24, 2014|Categories: Data Analysis, Python, Statistics|Tags: |0 Comments