Assessing enrichment

Working on a set of RNA-seq of AML patient samples, I stumble on gene X. When its expression is high, 50% of the samples are mutated on gene Y, a mutation that has a prevalence of only 20% in the rest of the dataset. Is there a link between these two observations? Let’s put some numbers on this: among the 131 samples of the dataset, 28 show mutations on gene Y, 6 have high expression of X and 3 have both “features”. The table below is called a 2×2 contingency table:

	Mutation on Y	No mutation on Y	Total
High expression of X	3	3	6
Low expression of X	25	100	125
Total	28	103	131

The story gets a bit more complex: only 6 samples show high expression of X. Now suppose that it is just pure coincidence in our dataset that we see a few more individuals with both features. How frequent would that coincidence be if we were to assume that the frequency of the mutation is equal between groups displaying high and low expression of X? This probability is conveniently computed with a Fisher’s exact test, in R:

> fisher.test (matrix (c(3, 3, 25, 100), nrow=2, byrow=T))

Fisher's Exact Test for Count Data

data: matrix(c(3, 3, 25, 100), nrow = 2)
p-value = 0.1116
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.4982704 31.2702870
sample estimates:
odds ratio
3.944561

Two numbers are of interest in this output. The p-value is 0.1116, saying that there is a 11% chance of seeing such a discrepancy between the two groups in absence of real differences. The odds ratio estimates the magnitude of the enrichment. So, despite the fact that we observe 4 times more mutations in the group with high expression of X (1:1 vs. 1:4), we can’t rule out the possibility of a coincidence.

Be careful, this data does not support an absence of difference either. In fact, we can’t even confirm that the odds ratio (enrichment) is less than 4:

> fisher.test (matrix (c(3, 3, 25, 100), nrow=2, byrow=T), or=4, alt="less")

Fisher's Exact Test for Count Data

data: matrix(c(3, 3, 25, 100), nrow = 2, byrow = T)
p-value = 0.6555
alternative hypothesis: true odds ratio is less than 4
95 percent confidence interval:
0.00000 23.04848
sample estimates:
odds ratio
3.944561

So much for the saying that statistics can tell any story… here, they are stubbornly quiet!