Working with large files

When dealing with Next Generation Sequencing data, I am routinely asked by clients how to open sequence files. The answer is that given their huge size (often many million lines) and the consequent requirement in memory, they should probably not be opened in any way, they should only be processed.

Most software designed to work with NGS data will then process these files in a sequential fashion or stream, loading just the required amount of data from disk, processing it in memory and then outputting a result, freeing the memory used by the input data.

Historically, computers never had that much RAM to begin with and thus, in the 70s (!), a plethora of tools have evolved on UNIX to perform that kind of processing directly from the shell. These tools are now standard on Linux and MacOS. Here, I will show a few of them that have proved especially useful in dealing with large files and providing the glue between parts of NGS computation pipelines.

sed

sed appeared in 1974 as a stream editor for UNIX that would allow to perform all sorts of manipulation on text files. Its most famous use is probably as a string replacement tool although many other functions are available. For example, say you download a gene annotation file from Ensemble only to realize that chromosomes are numbered from 1 to MT, whereas your reference genome was numbered from chr1 to chrM. Two consecutive calls to sed will allow conversion from one to the other:

sed 's/^/chr/g' Homo_sapiens.GRCh37.75.gtf | sed 's/chrMT/chrM/g' > Homo_sapiens.GRCh37.75.modified.gtf

sed can also be useful to extract parts of a file. For a simple tab-delimited file where you would like to extract field x, a simple cut -f x would do the trick. But for a less structured case, for instance a GTF file that stores data as a list of attributes in no particular order, you need the power of sed and regular expressions. For example, to extract gene names from a GTF file, knowing that field 8 contains strings similar to gene_name “GAPDH”, you could use the following sed command to fetch that portion of each line:

sed 's/.*gene_name "\([^"]*\)".*;/\1/g' Homo_sapiens.GRCh37.75.gtf

And then pipe that result to two other useful shell tools, sort and uniq, so that you end up with a non-redundant list of all genes.

sed 's/.*gene_name "\([^"]*\)".*;/\1/g' Homo_sapiens.GRCh37.75.gtf | sort | uniq > gene_list.txt

awk

awk was developed a few years later, in 1977 with a similar goal in mind, that is to facilitate one liner text modification, but with a larger set of functionalities. In particular, it introduced conditions and variables. As an example, let’s say you would like to calculate the average coverage of targeted regions in an exome capture experiment, you could make use of the excellent bedtools package to first compute the depth at each position of each feature in the annotation file and pipe the output to awk to summarize coverage by adding the last field of each line ($NF) and outputting only the average:

bedtools coverage -d  -a targets.bed -b sample.bam -split \
| awk '($1 == "chr1") { sum += $NF } END { print sum/NR }'

Stream redirection

In the last example, we saved some disk space by not having to write the output of bedtools to disk. However, if we would now like to compute the number of bases that have a 20X coverage or more, we would have to spend a few minutes to recompute coverage or save the intermediate result to disk (typically a few GB). We can optimize all this by using the tee command.

This tool, as its name implies, is used to split the standard output of a program so that it can be both displayed and saved to a file. Using some additional shell stream trickery, we can send the output to two processes at once:

bedtools coverage -d -a targets.bed -b sample.bam -split \
 | tee >(awk '{ sum += $NF } END { print sum/NR }' > coverage_mean.txt) \
 | awk '$NF>=20' | wc -l > coverage_20x.txt

A last example concerns the input to programs, specifically when you would like to send some data from a process to a software that wasn’t designed to accept stdin as input or if you want to send two stdout at once to the same program. Looking at the bedtools example again, we could calculate coverage on chromosome 1 only by using the following command:

bedtools coverage -d  -a <(grep ^chr1 targets.bed) -b <(samtools view sample.bam chr1) -split

Here, we have replaced two file arguments with calls to other commands. In the background, file descriptors will be created and used as buffer to stream the data between programs but once more, nothing is written on disk.

Obviously, we have only scratched the surface of the functionalities offered by these shell commands. Many other UNIX tools exist that deserve mention here: grep, cat, join, cut, wc, etc. I encourage you to take a look at them to increase your shell scripting skills!