You’ve probably heard the expression “Big Data” before. Particularly, if you read Simon Mathien’s blog post on IRIC’s website. (If you haven’t read it yet, you should do it now!).

There exist several definitions (or interpretations) of this expression, which is best summarized by the following two :

Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data

Oxford English Dictionary

Domaine technologique dédié à l’analyse de très grands volumes de données informatiques (petaoctets), issus d’une grande variété de sources, tels les moteurs de recherche et les réseaux sociaux ; ces grands volumes de données. (Recommandation officielle : mégadonnées.)

French Larousse

The Oxford English Dictionary’s definition presents the idea of a challenging dataset whereas the French Larousse’s definition talks about the domain dedicated to the analysis of big amounts of data.

I think the Oxford Dictionary’s definition better describes the current state of big data in biomedical research, since we are not yet able to benefit from all the data available worldwide (or even coming from a few different medical centers).

Challenges arise at different levels when working with big data.  Apart from the obvious and important technical issues related to the data size and the limitation of our resources (bandwidth, memory, disk storage), big data also comes with “logistic” considerations. Those include finding datasets of interest, downloading and formatting them in the right way, retrieving and understanding their metadata, making the different datasets compatible, and integrating all datasets together.

I’m not sure how it goes in the other areas of bioinformatics, but in genomics, more precisely in the analysis of gene expression, those “logistics” considerations are not trivial. I plan on looking at those other areas in the future, but for now, I’ll concentrate on gene expression.

On finding and downloading the data

Several efforts were made in order ease data accessibility (and often allow its analysis online). For example, the Gene Expression Omnibus (GEO) for the deposition of gene expression data generated from microarrays and RNA-Seq standardizes file formats and offers metadata (which is mandatory upon deposition). Metadata (for example patient characteristics, experimental conditions, aim of the study, protocols used, data preprocessing and transformation, etc.) are important. Without the precious metadata, datasets are quite useless because you don’t know what you’re working with. ArrayExpress is another resources for data deposition.

Regarding big sequencing projects, they usually provide a data portal for downloading data and metadata. This is the case for GTex which has it’s own data portal and The Cancer Genome Atlas (TCGA) which now uses the Harmonized Cancer Datasets/Genomic Data Commons (GDC) Data Portal to serve their data. GDC contains the data of 39 projects (including the TCGA projects) for a total of 14,551 cases in 29 primary sites. Harmonizing and standardizing the different gene expression datasets is highly desirable although the appropriate way of doing so is not necessarily clear. GDC reprocesses the raw data to provide datasets processed in an identical fashion (using the same genome version, same mapping pipeline and quantification method, etc).

On integrating datasets

However, even if some undergoing standardization efforts are in place, the biggest challenge of all is knowing how to merge multiple datasets together. Datasets generated with different technologies (microarrays versus RNA-seq) or processed with different workflows (RPKM, rsem, kmer counts for RNA-seq data) are not directly comparable. In addition, it’s not uncommon to see experimental biaises among samples of a given project (which used the same technology and processing approach). Knowing that, imagine the biases occurring between samples from different medical centers (using different experimental kit) that are then analyzed using different technologies, processing… The possible experimental variations are endless! While the effort of GDC are commendable, we probably can not reprocess all existing raw data in the same way.
Can we still envision merging any gene expression datasets together? Apart from normalizing and correcting the known biaises, what can be done to enhance datasets compatibility and be able to exploit the high amount of data available nowadays? I’m not sure. And I’m not even thinking about merging different kind of data yet (perturbation experiments with sequencing and chemical screen, for example).

What do you think?

I don’t know if complete integration is possible, feasible or even desirable for gene expression data. However, for medical images, it might be different. As for proteomic’s data, I don’t have a clue.

What do YOU think? I would like to know your opinion on the matter. Maybe you’ve already experienced a ‘big data’ situation lately. How was it for you? Please, leave a comment below or contact me directly.

I’ll summarize your opinions/concerns/experiences in my next post so we can have a nice picture of how we each deal with big data in our research.