Bioinformatic in a container

Bioinformatic in a container

A recent tendency coming from the world of cloud computing is gaining more and more popularity in the bioinformatic community. This tendency is to develop and deploy application in a container. This container contains not only the application but all the needed libraries and a minimalist version of the applications of the operating system. As soon as it is built, the container is ready for use on a host computer containing the environment required to start the container.

For a developer, this is quite interesting since the environment which is used for development and testing is identical on the software side to the environment that will be used in production on a server. Since the container is isolated from the rest of the operating system, there is no more problems related to versions conflicts or updates. No more broken applications due to a system libraries update by the system administrator.

For a user, this is also very interesting since he won’t have to download and install all the dependencies required by an application avoiding all the cases where dependencies are not well documented or not available at all anymore.

Containers are interesting for the distribution and utilisation of bioinformatic softwares.
A popular environment to develop, distribute and execute the containers is Docker. Docker allows the deployment of containers under Linux. This platform was chosen by the bioinformatic community for several projects. Among those : Bioboxes

[génomique], BioDocker [protéomique], BioShaDock and
DockStore.

To get started with Docker, you can download Docker for Linux or the Docker Toolbox for Windows and Mac. This last tool allows the deployment of virtual machines with Docker on your computer or in the cloud (Amazon, Google, Microsoft). When Docker is installed, you can either build a new container or retrieve an existing one from a public repository.

Here is a small Dockergile script showing the list of instructions needed to build a container for NCBI Blast+ (the example comes from BioDocker (https://github.com/BioDocker/containers/blob/master/blast/2.2.31/Dockerfile):

#################################################################
# Dockerfile
#
# Version: 1
# Software: NCBI BLAST+
# Software Version: 2.2.31
# Description: basic local alignment search tool
# Website:
http://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download
# Tags: Genomics|Proteomics|Transcriptomics|General
# Provides: blast 2.2.31
# Base Image: biodckr/biodocker:latest
# Build Cmd: docker build biodckr/blast 2.2.31/.
# Pull Cmd: docker pull biodckr/blast
# Run Cmd: docker run biodckr/blatst #################################################################

# Source Image
FROM biodckr/biodocker

################## BEGIN INSTALLATION ###########################

# install
RUN conda install blast=2.2.31

# Change workdir to /data/
WORKDIR /data/

##################### INSTALLATION END ##########################

# File Author / Maintainer
MAINTAINER Saulo Alves Aflitos
# Modified by Felipe da Veiga Leprevost 06-17-2016

FROM is the first command in the Dockerfile. This defines the source image on which the container is based. Here, the image is biodcker/biodocker which is based on a minimalist image of ubuntu:14.04.3 (see ici for the details about how this image is built and for a more complete example of how a container is built). This image contains all the global dependencies for the BioDocker images. This image structure is useful, not only because it can be reuse but also for updates (update of the operating system for example). A container is built with a filesystem organised in layers to avoid data duplication. Then, the RUN command installs Blast in the container using the package manager conda which is available in the biodcker/biodocker image. Finally, WORKDIR changes the working directory for the future execution of the application. Since BioDocker offers this container image in the public repository of DockerHub, we can avoid the building steps and simply retrieve the container as indicated in the Dockerfile.

# docker run biodckr/blast blastp -h

Unable to find image ‘biodckr/blast:latest’ locally
latest: Pulling from biodckr/blast
8387d9ff0016: Already exists
3b52deaaf0ed: Already exists
4bd501fad6de: Already exists
a3ed95caeb02: Already exists
af6fa2683829: Already exists
fb2d2930af28: Already exists
e712c46836f6: Already exists
c0a2096039c4: Already exists
Digest:
sha256:204e2c2ad3c0d55c0a24b06444c2bb5a9a8edc2918cd5f959d0e5d3d33ba292a
Status: Downloaded newer image for biodckr/blast:latest
USAGE
blastp [-h] [-help] [-import_search_strategy filename] [-export_search_strategy filename] [-task task_name] [-db database_name] [-dbsize num_letters] [-gilist filename] [-seqidlist filename] [-negative_gilist filename] [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm] [-subject subject_input_file] [-subject_loc range] [-query input_file] [-out output_file] [-evalue evalue] [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty] [-qcov_hsp_perc float_value] [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value] [-xdrop_gap_final float_value] [-searchsp int_value] [-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking] [-matrix matrix_name] [-threshold float_value] [-culling_limit
int_value] [-best_hit_overhang float_value] [-best_hit_score_edge float_value] [-window_size int_value] [-lcase_masking] [-query_loc range] [-parse_deflines] [-outfmt format] [-show_gis] [-num_descriptions int_value] [-num_alignments int_value] [-line_length line_length] [-html] [-max_target_seqs num_sequences] [-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
Protein-Protein BLAST 2.2.31+

Use ‘-help’ to print detailed descriptions of command line arguments

Using only one command, Docker retrieved the application container on DockerHub and has executed the application. In the history, we first see each layer of the container file system. Each one is downloaded and extracted. Then, Docker started the container and executed Blastp. The execution via the command line is quite simple for this example. However, it becomes a little bit more complex when we want to transfer data from the host to the container.

Docker is also interesting for the development and deployment of web services. These services are generally complex to install since they required a web server, a database, etc. With Docker, it is possible to put each component in it’s own container and then to manage their interactions. Thus, we avoid reinstalling everything at each deployment and the containers can be reuse in different projects. The management of this kind of service is however more complex.

For those who would like to dive into the universe of containers, I recommend to have a good a priori experience with Linux system administration. The technology behind Docker is relatively new (2013) and evolves quickly. A lot of tools are available to manage this kind of environment and it’s important to put the appropriate amount of time to understand it.

I hope you enjoyed this very short introduction on containers for bioinformatic. The use of this new technology is increasing and I personally think that it has a lot of potential to simplify the distribution of bioinformatic tools via public repository or simply by easing the installation process.

By | 2016-11-08T09:30:04+00:00 July 21, 2016|Categories: Bioinformatics|0 Comments

About the Author:

Leave A Comment