When working with a software that accesses data from disk in a random fashion, it is common knowledge that best performance will be reached using SSD hard drives, with SAS disks being less efficient and SATA disks being the worst. However, high capacity SSD drives are still relatively expensive and thus, when working with large datasets, one typically ends up working with data stored on larger, and more common SATA drives.

I recently experimented with the Jellyfish software to analyze k-mers found in a large cohort of leukemia RNA-seq samples. For this dataset of 500 samples, the kmer hash tables represent 1.4 TB of data.

The software used was km, a tool to identify mutations by reconstructing all possible sequence paths that start and end with the same k-mers as a small reference sequence. For simple cases, this tool will query each Jellyfish hash table for a few thousands k-mers (out of several million) for each RNA-seq sample. Given that this scenario involves many random access to disk we obtain the following runtime when the hash table is on a SATA RAID1 array:

time ./find_mutation -c 2 -p 0.02 query.fa /scratch/kmers.jf

real 1m22.442s
user 0m1.422s
sys  0m1.098s

We can dramatically improve this by taking advantage of the strength of SATA disk, that is sequential throughput. The following seems like a very bad idea but if we first copy the whole hash table to RAM disk and then run our query, we obtain:

time (cp /scratch/kmers-2.1.3_21.jf /dev/shm/ \
      && ./find_mutation -c 2 -p 0.02 test.fa /dev/shm/kmers-2.1.3_21.jf)

real 0m21.879s
user 0m0.916s
sys  0m4.005s

Now keep in mind that in this second run, we are sequentially copying about 3GB of data to RAM, in addition to querying for the occurence of kmers (a few KB), and the whole process is 5 times faster. Investigating further, we find that the file copy operation takes about 18 seconds and km then runs in 3 seconds.

Large scale test

Our friends at IBM recently lent us a FlashSystem 820 to evaluate its performance and I took the opportunity to run some of these tests. This is the state-of-the-art in terms of microseconds latency and our demo unit came with 8Gbps fiber channel connectivity and contained 12 1TB flash drive. Here are the results of looking for mutations in 5 genes in 147 samples (or 465GB of data).

System RAMdisk Time (sec) Estimated total list price*
Xeon E5-2630 with SATA storage (IBM V3700) No 22,858 64,000$
Xeon E5-2630 with flash storage (IBM FlashSystem 820) No 2,182 400,000$
My workstation (i7-3770 with RAID 1 SATA + RAMdisk) Yes 4,515 3,000$

*Estimated price of the server and storage, excluding SAN switches.

The FlashSystem produces impressive results on its own and is a quite remarkable piece of equipment. Nevertheless, taking advantage of the strength of SATA drives and the random access speed of RAM, we achieve pretty good times with a solution that is a hundred times cheaper. And for the ultimate benefit: this strategy almost eliminates grinding noises produced by my workstation!