Python

Fastest method to compute an AUC

Context: AUC is an acronym for "Area Under the (ROC) Curve". If you are not familiar with the ROC curve and AUC, I suggest reading this blog post before to continuing further. For several projects, I needed to compute a large number of AUC. It started with 25,000, increased to 230,000 and now I need to compute 1,500,000 AUC. With so many AUC, the time to compute each one becomes critical. On the web, I don't find much information about this specific [...]

By | August 18, 2016|Categories: Data Analysis, Performance, Python, R, Statistics|0 Comments

SciPy and Logistic Regressions

Given a set of data points, we often want to see if there exists a satisfying relationship between them. Linear regressions can easily be visualized with Seaborn, a Python library that is meant for exploration and visualization rather than statistical analysis. As for logistic regressions, SciPy is a good tool when one does not have his or her own analysis script. Let's look at the optimize package                        from scipy.optimize import [...]

By | June 9, 2016|Categories: Bioinformatics, Data Analysis, Data Visualization, Python|0 Comments

Parallelize your Python !

This article will teach you what are multithreads, multicores, and in what circumstances each can be used. Your nerd friend keeps telling you about his professional deformation all the time? Wanting to parallelize and optimize his time? Do you wish to understand it as well and save time by parallelizing your programs in Python? Then this article is what you need! You will be able to gain big amounts of time, thanks to a small dose of parallelism [...]

By | March 29, 2016|Categories: Bioinformatics, Performance, Python|Tags: |0 Comments

Factorial and Log Factorial

Factorial: When you need to calculate n!, you have several solutions.  The "rush" solution: using a loop or a recursive function:  def factorial_for(n): r = 1 for i in range(2, n + 1): r *= i return(r) def factorial_rec(n): if n > 1: return(n * factorial_rec(n - 1)) else: return(1) Here, the multiplication of the numbers sequentially will create a huge number very quickly. This is good, but computers are faster when 2 small numbers (120x30240) are involved in a multiplication versus the [...]

By | February 22, 2016|Categories: Bioinformatics, Data Analysis, Performance, Python|0 Comments

Generating Synthetic Genomic Data

Applying statistical methods is a large part of the work of a bioinformatician. Apart from some more classical techniques, machine learning algorithms are also regularly applied to clinical and biological data (notably, clustering techniques such as k-means). Some techniques such as artificial neural networks have recently found great success in areas such as image recognition and natural language processing. However, these techniques do not perform as well on small datasets with high dimensionality, a problem known as "the curse of dimensionality". [...]

By | January 7, 2016|Categories: Bioinformatics, Data Analysis, Python|0 Comments