Data Analysis

Understanding how kallisto works

In 2016,  Bray et al. introduced a new k-mer based method to estimate isoform abundance from RNA-Seq data.  Their method, called kallisto, provided a significant improvement in speed and memory usage compared to the previously used methods while yielding similar accuracy.  In fact, kallisto is able to quantify expression in a matter of minutes instead of hours.  Since it is so light and convenient, kallisto is now often used to quantify expression in the form of TPM.   But how does [...]

By | 2018-04-08T15:01:03+00:00 March 28, 2018|Categories: Bioinformatics, Data Analysis|1 Comment

Overfitting and Regularization

This series of articles on machine learning wouldn't be complete without dipping our toes in overfitting and regularization. Overfitting The Achille's heel of machine learning is overfitting. As machine learning techniques get more and more powerful (large number of parameters), exposure to overfitting increases. In the context of an overfit, the model violates Occam's razor's principle by generating a model so complex that it begins to memorise small, unimportant details (with no true link to our target) of the training set. [...]

By | 2017-10-30T12:54:46+00:00 October 30, 2017|Categories: Data Analysis, Machine learning, Uncategorized|0 Comments

Big data, big challenge – part 2

This post follows my previous post on big data. Even though the latter did not result in a big virtual discussion, I was pleased to read some comments regarding the situation in other areas of bioinformatics. Proteomics Mathieu Courcelles, bioinformatician at the proteomics platform, explained that mass-spectrometry driven proteomics has always generated 'big data', so this expression is not used in the field. As he said, Mass spectrometers are indeed instruments that generate a large volume of data 24/7. Early on [...]

By | 2017-08-18T13:24:34+00:00 August 18, 2017|Categories: Data Analysis|Tags: , |0 Comments

Gradient Descent

Gradient descent is an iterative algorithm that aims to find values for the parameters of a function of interest which minimizes the output of a cost function with respect to a given dataset. Gradient descent is often used in machine learning to quickly find an approximative solution to complex, multi-variable problems. In my last article, Introduction to Linear Regression, I mentioned gradient descent as a possible solution to simple linear regression. While there exists an optimal analytical solution to simple [...]

By | 2017-08-03T16:23:44+00:00 August 3, 2017|Categories: Data Analysis, Machine learning, Python, Uncategorized|0 Comments

R or Python, you choose!

Updated 27/08/2018 I have already briefly introduced pandas, a Python library, by comparing some of its functions to their equivalents in R. Pandas is a library that makes Python almost as convenient as R when doing data visualization and exploration from matrices and data frames (it is built on top of numpy).  It has evolved a lot these past few years as has its community of users. Although pandas is being integrated in a number of specialized packages, such as rdkit [...]

By | 2018-08-28T10:18:53+00:00 June 26, 2017|Categories: Data Analysis, Python, R|Tags: , |1 Comment