You’ve probably heard the expression “Big Data” before. Particularly, if you read Simon Mathien’s blog post on IRIC’s website. (If you haven’t read it yet, you should do it now!).
There exist several definitions (or interpretations) of this expression, which is best summarized by the following two :
Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data
Domaine technologique dédié à l’analyse de très grands volumes de données informatiques (petaoctets), issus d’une grande variété de sources, tels les moteurs de recherche et les réseaux sociaux ; ces grands volumes de données. (Recommandation officielle : mégadonnées.)
The Oxford English Dictionary’s definition presents the idea of a challenging dataset whereas the French Larousse’s definition talks about the domain dedicated to the analysis of big amounts of data.
I think the Oxford Dictionary’s definition better describes the current state of big data in biomedical research, since we are not yet able to benefit from all the data available worldwide (or even coming from a few different medical centers).
Challenges arise at different levels when working with big data. Apart from the obvious and important technical issues related to the data size and the limitation of our resources (bandwidth, memory, disk storage), big data also comes with “logistic” considerations. Those include finding datasets of interest, downloading and formatting them in the right way, retrieving and understanding their metadata, making the different datasets compatible, and integrating all datasets together.
I’m not sure how it goes in the other areas of bioinformatics, but in genomics, more precisely in the analysis of gene expression, those “logistics” considerations are not trivial. I plan on looking at those other areas in the future, but for now, I’ll concentrate on gene expression.
On finding and downloading the data
Several efforts were made in order ease data accessibility (and often allow its analysis online). For example, the Gene Expression Omnibus (GEO) for the deposition of gene expression data generated from microarrays and RNA-Seq standardizes file formats and offers metadata (which is mandatory upon deposition). Metadata (for example patient characteristics, experimental conditions, aim of the study, protocols used, data preprocessing and transformation, etc.) are important. Without the precious metadata, datasets are quite useless because you don’t know what you’re working with. ArrayExpress is another resources for data deposition.
Regarding big sequencing projects, they usually provide a data portal for downloading data and metadata. This is the case for GTex which has it’s own data portal and The Cancer Genome Atlas (TCGA) which now uses the Harmonized Cancer Datasets/Genomic Data Commons (GDC) Data Portal to serve their data. GDC contains the data of 39 projects (including the TCGA projects) for a total of 14,551 cases in 29 primary sites. Harmonizing and standardizing the different gene expression datasets is highly desirable although the appropriate way of doing so is not necessarily clear. GDC reprocesses the raw data to provide datasets processed in an identical fashion (using the same genome version, same mapping pipeline and quantification method, etc).
On integrating datasets
However, even if some undergoing standardization efforts are in place, the biggest challenge of all is knowing how to merge multiple datasets together. Datasets generated with different technologies (microarrays versus RNA-seq) or processed with different workflows (RPKM, rsem, kmer counts for RNA-seq data) are not directly comparable. In addition, it’s not uncommon to see experimental biaises among samples of a given project (which used the same technology and processing approach). Knowing that, imagine the biases occurring between samples from different medical centers (using different experimental kit) that are then analyzed using different technologies, processing… The possible experimental variations are endless! While the effort of GDC are commendable, we probably can not reprocess all existing raw data in the same way.
Can we still envision merging any gene expression datasets together? Apart from normalizing and correcting the known biaises, what can be done to enhance datasets compatibility and be able to exploit the high amount of data available nowadays? I’m not sure. And I’m not even thinking about merging different kind of data yet (perturbation experiments with sequencing and chemical screen, for example).
What do you think?
I don’t know if complete integration is possible, feasible or even desirable for gene expression data. However, for medical images, it might be different. As for proteomic’s data, I don’t have a clue.
What do YOU think? I would like to know your opinion on the matter. Maybe you’ve already experienced a ‘big data’ situation lately. How was it for you? Please, leave a comment below or contact me directly.
I’ll summarize your opinions/concerns/experiences in my next post so we can have a nice picture of how we each deal with big data in our research.
You know, I feel that what is lacking in genomic data that we most certainly have in images is a “common measurement “. This is the obstacle we are hitting when trying to integrate data. With images it’s easy, even if the resolution changes, pixels remain pixels. With gene expression, we don’t have that small unit. Some people work with microarray probes, others with transcripts, some with genes and others with kmers. And since we can’t look at a transcriptome as we do with images, I think we need to look towards common representations. In other words finding that “pixel” for the transcriptome. Or methylome. Or any other data. I am strongly biased towards machine learning, I feel it can help find that integrative representation.
But this is all just my thoughts/intuition on the subject
In mass-spectrometry driven proteomics, there is currently little use of the “Big data” buzzword in the literature. Mass spectrometers are indeed instruments that generate a large volume of data 24/7. Early on and as mass spectrometer instruments evolved, distributed computing was a requirement to process all this data. I feel that processing large amount of data is not a new trend in our field of research. The “Big data” movement is offering us, however, convenient access to more computing resources (cloud) and to new generic data processing workflows (e.g. Hadoop, Spark, Dask, Docker). Right now, there are just a few laboratories that are using these new data processing workflows. Optimized and scalable workflows were already in place for processing raw data to identify proteins. These new generic processing workflows might be handy for further dataset processing or integration. Data sharing is rising up in proteomics. ProteomeXchange and HUPO proteomics standards initiative are the two main projects supported by the community. Data integration is also challenging in proteomics. One particular issue is with missing data. Depending on the experimental conditions and instruments, proteome coverage can differ widely.
Omics Discovery Index (OmicsDI; http://www.omicsdi.org),
A new resource was published this month to provide access to genomics, transcriptomics, proteomics and
metabolomics datasets.