Statistics

Bootstraps and Confidence Intervals

When analyzing data, you might want or need to fit a specific curve to a particular dataset. This type of analysis can result in instructive outputs regarding the relationship between two (or more...) quantifiable parameters. The main object of this post is not how to implement such fitting, but rather how to display the goodness of such a fit i.e. how to calculate a confidence interval around a fitted curve. That being said, I will show how to do curve fitting in [...]

By | 2017-04-29T18:33:55+00:00 September 29, 2016|Categories: Data Analysis, R, Statistics|Tags: |1 Comment

Fastest method to compute an AUC

Context: AUC is an acronym for "Area Under the (ROC) Curve". If you are not familiar with the ROC curve and AUC, I suggest reading this blog post before to continuing further. For several projects, I needed to compute a large number of AUC. It started with 25,000, increased to 230,000 and now I need to compute 1,500,000 AUC. With so many AUC, the time to compute each one becomes critical. On the web, I don't find much information about this specific [...]

By | 2017-04-29T16:56:33+00:00 August 18, 2016|Categories: Performance, Python, R, Statistics|Tags: |0 Comments

Standard deviation on a correlation scatter plot

I was recently asked by a colleague to provide visualization of differential gene expression computed using RPKM values (two samples, no replicates) and highlight genes that were outside the distribution by 2 standard deviations or more. As a first draft, I quickly obliged by calculating the fold change distribution, computing standard deviation and drawing lines on either side of the diagonal to obtain: This turns out to be equivalent to computing the standard deviation of the residual of a linear [...]

By | 2017-04-29T17:05:35+00:00 April 5, 2016|Categories: Data Visualization, R, Statistics|Tags: |0 Comments

Permutations

Say we have the two following groups : g1 <- c(55, 65, 58) g2 <- c(12, 18, 32) We want to see if the two groups belong to the same distribution or can be considered as different groups. We might be tempted to try a Student’s t-test. t.test(g1, g2) ## Welch Two Sample t-test ## ## data: g1 and g2 ## t = 5.8366, df = 2.9412, p-value = 0.01059 ## alternative hypothesis: true difference in means is not equal [...]

By | 2017-04-30T10:15:37+00:00 October 14, 2015|Categories: Data Analysis, R, Statistics|0 Comments

Don’t ignore the warnings!

I'm sure that all of you R users have now noticed that sometimes R is talking to you. When you do something wrong, R replies with a message written in red in the console. How many of you actually read those error messages? If you take the time to read them carefully, you'll get a hint about what was wrong in your command. Let's look at an example: > sum(c('1','3','4','4')) Error in sum(c("1", "3", "4", "4")) : invalid 'type' (character) [...]

By | 2017-04-30T16:25:19+00:00 September 3, 2015|Categories: R, Statistics|0 Comments