Document your work by adding parameters to your shell scripts

At some point during your bioinformatics career, you’re going to start writing shell scripts, it’s kind of inevitable ! So let us discuss a strategy to add parameters to your scripts in order make them more easily reusable, while also keeping a trace of the settings you used to generate a set of results.

(Disclaimer: the procedures described below have been tested using BASH, some modifications might be necessary if you use a different shell.)

Modifying your scripts in order for them to receive a set of parameters can be done in many ways.

Numbered Shell Variables

A super simple way of doing this is using the numbered shell variables. Basically, any set of strings added to your script call become available inside the script as $1, $2, etc..

%> cat simple_script.sh
#!/bin/bash
echo "Greeting: ${1}"
echo "Target: ${2}"

%> ./simple_script.sh hello world
Greeting: hello
Target: world

(Hint: While not mandatory, wrapping variable names inside curly braces ({}) is good practice and might save you some headaches down the road.)

Useful, but not exactly what we’re shooting for. Also, argument position becomes fixed, so we’re losing a bit of flexibility.

Getopts tool

The second option people usually come across is using the built-in getopts tool.

%> cat getopts_script.sh
#!/bin/bash
while getopts g:t: option
do
  case "${option}"
  in
    g) GREETING=${OPTARG};;
    t) TARGET=${OPTARG};;
  esac
done

echo "Greeting: ${GREETING}"
echo "Target: ${TARGET}"

%> ./getopts_script.sh -g hello -t world
Greeting: hello
Target: world

The order of the parameters is no longer mandatory:

% ./getopts_script.sh -t world -g hello
Greeting: hello
Target: world

This is good, we’re getting closer to our goal !
But before we go any further, let’s take a minute to understand what is happening here.
Essentially, we are iterating over the presence of parameters in the script call (the while loop) and executing defined code blocks based on the presence of parameters (with the case statement).

Putting it all together

Let’s use what we’ve seen so far in order to achieve our goal of having a reusable script along with a log of the parameters we used for a specific execution.

Consider this very simple freebayes.sh script:

%> cat freebayes.sh
#!/bin/bash
${FREEBAYES_SOFT} -C ${COVERAGE} -f ${GENOME} ${BAM} > ${PWD}/var.${COVERAGE}X.vcf

As you can surmise, the following values are now parameters which need to be provided:
FREEBAYES_SOFT: the Freebayes executable COVERAGE: the coverage parameter of Freebayes GENOME: the path of the genome BAM: the path to the required .bam file containing our aligned NGS data
(Hint: PWD is a built-in BASH variable which references the current working directory)

Let’s add in some parameter parsing code:

%> cat freebayes.sh
#!/bin/bash
while getopts c: option
do
  case "${option}"
  in
    c) source ${OPTARG};;
  esac
done

${FREEBAYES_SOFT} -C ${COVERAGE} -f ${GENOME} ${BAM} > ${PWD}/var.${COVERAGE}X.vcf

So why are we expecting a single parameter ? The answer lies in the source statement.
Remember I suggested we write stuff in order to keep a log of the parameters we used for a specific execution ?
Well, we can now achieve our goal by specifying the values of our freebayes.sh script’s parameters inside a second file (let’s call it freebayes_run.conf) like so:

%> cat freebayes_run.conf
#!/bin/bash
FREEBAYES_SOFT=/soft/bioinfo/freebayes-1.2.0/bin/freebayes
COVERAGE=5
GENOME=/soft/bioinfo/resources/genomes/GRCh38_88/genome/GRCh38.fa
BAM=~/my_project/alignments/my_sample.bam

I can now store my reusable freebayes.sh script in some project-wide folder (such as ~/my_project/lib/) and execute it from somewhere else while providing the appropriate .conf file. A good strategy being to keep a copy the .conf file inside the analysis directory.

%> ~/my_project/lib/freebayes.sh -c ./freebayes_run.conf

With this strategy in place, if you ever need to figure out which freebayes binary or reference genome was used to create a set of results a few years later, you can just open the .conf file in the results folder to find out ! It might mean putting in a little extra effort, but I think it’s well worth it.

Let me know what you think !