Data Analysis

Grep parameters every bioinformatician should know

Your shell, along with the myriad command line programs it exposes is clearly a great friend when it comes to file manipulation. And let's face it, file manipulation is a big part of a bioinformatician's daily workload. Now, since we rarely have the time to review all the options offered by the different programs I thought I'd list some really useful ones from grep. I expect everyone to know what grep is and what it does so let's just get [...]

By | 2017-04-29T15:35:48+00:00 November 27, 2015|Categories: Data Analysis, Shell scripting|Tags: , |0 Comments

Applying PCA to Leucegene data

GEO offers an extremely rich source of transcriptional profile data, but downloading and preparing a dataset is often an obstacle to aspiring bioinformaticians. I'll walk you through one way to do it using the Leucegene dataset as an example. Once this data is loaded and ready to use in R, I'll then present a very simplified and practical perspective on the use of PCA for exploratory analysis. Loading data A dataset of 285 transcriptional profiles of acute myeloid leukemia (AML) [...]

By | 2017-04-29T23:05:21+00:00 November 17, 2015|Categories: Data Analysis, R|Tags: , |0 Comments


Say we have the two following groups : g1 <- c(55, 65, 58) g2 <- c(12, 18, 32) We want to see if the two groups belong to the same distribution or can be considered as different groups. We might be tempted to try a Student’s t-test. t.test(g1, g2) ## Welch Two Sample t-test ## ## data: g1 and g2 ## t = 5.8366, df = 2.9412, p-value = 0.01059 ## alternative hypothesis: true difference in means is not equal [...]

By | 2017-04-30T10:15:37+00:00 October 14, 2015|Categories: Data Analysis, R, Statistics|0 Comments

Working with large files

When dealing with Next Generation Sequencing data, I am routinely asked by clients how to open sequence files. The answer is that given their huge size (often many million lines) and the consequent requirement in memory, they should probably not be opened in any way, they should only be processed. Most software designed to work with NGS data will then process these files in a sequential fashion or stream, loading just the required amount of data from disk, processing it [...]

By | 2017-04-30T10:19:35+00:00 October 1, 2015|Categories: Data Analysis, Shell scripting|Tags: , |2 Comments

Kaplan-Meier plot

When working with cancer datasets, one of the goal is sometimes to find features (mutation, clinical information, gene expression, ...) associated to prognosis, i.e. features related to the probable outcome of the disease. If that's one of your goal, you'll have to do a survival analysis.  Survival analysis involves a set of methods to model the time at which an event of interest occurs, that event often being death.  But really, any event for which the time of occurence is [...]

By | 2017-04-29T17:14:26+00:00 February 19, 2015|Categories: Data Analysis, Statistics|Tags: |0 Comments