Standard deviation on a correlation scatter plot

I was recently asked by a colleague to provide visualization of differential gene expression computed using RPKM values (two samples, no replicates) and highlight genes that were outside the distribution by 2 standard deviations or more. As a first draft, I quickly obliged by calculating the fold change distribution, computing standard deviation and drawing lines on either side of the diagonal to obtain:

RPKM_1

This turns out to be equivalent to computing the standard deviation of the residual of a linear model fitted on this distribution. In R:

sd(residuals(lm(y ~ x + 0)) ~= sd (log(FC))

My colleague was quick to point out that this was less than satisfying and that (s)he was expecting the standard deviation to vary along the diagonal. That makes perfect sense although after many minutes of search, no answer was in sight. Eventually, we obtained replicates for this projects and I then performed a full blown differential expression analysis using DESeq2 in which I used the adjusted p-value to color significant genes (p-adj < 0.001):

DESEQ2

Fast forward a few weeks later and the answer suddenly appeared out of nowhere in the form of simple geometry and I ended up testing the following. First, fit a linear model on the distribution to obtain the slope (m) of the model, which should incidentally be close to 1 if data is correctly normalized. In my case, I obtained 0.986, not bad at all. Then, rotate the distribution by the angle corresponding to that slope, or atan(m), but clockwise so that Theta=-atan(m). This is simply done by converting each coordinate (x,y) using a rotation matrix:

We can then compute a windowed standard deviation on the y’ values along the x’ axis and transform back these values as two curves using a rotation of -Theta. This yields:

RPKM_2

A keen observer will note that this is almost the same as computing a windowed standard deviation curve of the A values of a MA plot along the M axis. It would in fact be the same if the rotation angle was exactly pi/4. Also, the standard deviation threshold alone would probably not be an excellent criteria to identify interesting genes as it would tend to select many lowly expressed ones. A limit on expression level and fold change remains our best option when working without replicates.

About the Author: Patrick

Former physicist turned structural biologist and software developer, he now manages a team of talented bioinformaticians at the platform. Racing god and master barista aside, he expertly handles the next-gen sequencing analysis service and IT infrastructure. Legend has it that he was roaming the future site of IRIC prior to its foundation. He also wrote this entire bio himself in the third person.

3 Comments

Amy January 9, 2020 at 16:06 - Reply

could you share the R code for this?
Ishwarya Murali April 26, 2020 at 18:04 - Reply

Hi,
This is very helpful for doing pairwise comparisons. Could you please share the R code for this?

Patrick May 8, 2020 at 10:23 - Reply

Here is the code for the last graph:


scatterPlot <- function(x, y, standard_dev=2, window=1) {
    mod = lm(y~x)
    slope=mod$coefficients["x"]
    intercept=if (is.na(mod$coefficients["(Intercept)"])) 0 else mod$coefficients["(Intercept)"]
    angle = atan(slope)
    
    rotate = function (x,y, theta, intercept){
        c(cos(theta) * x + sin(theta) * (y-intercept),
          -sin(theta) * x + cos(theta) * (y-intercept))
    }
    
    ## Rotate so that slope is 0
    rot = mapply(rotate, x, y, angle, intercept)
    xp = rot[1,]
    yp = rot[2,]
    
    ## Compute standard deviation along the x axis at position pos,
    ## for data in a given window (a lower window will be less smooth)
    sdWindow = function (pos, window, x, y) {
        sel = which (x >= pos-window/2 & x <= pos+window/2) 
        sd(y[sel])
    }

    xsd = seq(min(xp),max(xp), length=100)
    ysd = sapply(xsd, sdWindow, window, xp, yp)
    
    ## Transform back the data, and two curves for the + and - standard dev
    xpp = mapply(rotate, xp, yp, -angle, -intercept)
    sdp = mapply(rotate, xsd, standard_dev*ysd, -angle, -intercept)
    sdm = mapply(rotate, xsd, -standard_dev*ysd, -angle, -intercept)

    sdpdf = data.frame(x=sdp[1,], y=sdp[2,])
    sdmdf = data.frame(x=sdm[1,], y=sdm[2,])
   
    cbPalette <- c("#dddddd","#0000ff", "#00ff00")
    df = data.frame(x=x, y=y)
    p = ggplot(df, aes(x=x, y=y))
    p + geom_point (shape=20, aes(colour="#dddddd")) +
        geom_abline (intercept=intercept, slope=mod$coefficients["x"], color="#4795c3", linetype=2) +
        geom_line (data=sdpdf, aes(x=x, y=y), color="#f93a3a", linetype=2) +
        geom_line (data=sdmdf, aes(x=x, y=y), color="#f93a3a", linetype=2) +
        scale_colour_manual(values=cbPalette) +
        scale_size_manual(values=c(2,3)) +
        theme_bw () + theme(plot.title = element_text(lineheight=.8, face="bold"), legend.position = "none")
}

Standard deviation on a correlation scatter plot