big data

Big data, big challenge – part 2

This post follows my previous post on big data. Even though the latter did not result in a big virtual discussion, I was pleased to read some comments regarding the situation in other areas of bioinformatics. Proteomics Mathieu Courcelles, bioinformatician at the proteomics platform, explained that mass-spectrometry driven proteomics has always generated 'big data', so this expression is not used in the field. As he said, Mass spectrometers are indeed instruments that generate a large volume of data 24/7. Early on [...]

By |2017-08-18T13:24:34+00:00August 18, 2017|Categories: Data Analysis|Tags: , |0 Comments

Big data, big challenge

You've probably heard the expression "Big Data" before. Particularly, if you read Simon Mathien's blog post on IRIC's website. (If you haven't read it yet, you should do it now!). There exist several definitions (or interpretations) of this expression, which is best summarized by the following two : Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data Oxford English Dictionary Domaine technologique dédié [...]

By |2017-05-02T21:05:43+00:00April 24, 2017|Categories: Data Analysis|Tags: , , |3 Comments

Working with large files

When dealing with Next Generation Sequencing data, I am routinely asked by clients how to open sequence files. The answer is that given their huge size (often many million lines) and the consequent requirement in memory, they should probably not be opened in any way, they should only be processed. Most software designed to work with NGS data will then process these files in a sequential fashion or stream, loading just the required amount of data from disk, processing it [...]

By |2017-04-30T10:19:35+00:00October 1, 2015|Categories: Data Analysis, Shell scripting|Tags: , |1 Comment