Recently, everyone and their mother started using various tools in order to optimize large data transfer to, from and between supercomputers. Historically, we have seen tools like FDT, BBCP that tried to exceed the performance obtained from other transfer methods, like scp, rsync, ftp, etc. One tool in particular is now gaining traction and is being deployed on most supercomputers: GridFTP and its front-end Globus.
Before jumping into the bandwagon, I thought it would be nice to get an idea of what is to gain by using such a service. Globus main goals are to make data transfer faster on high-bandwidth network as well as simplify data sharing among users.
While doing this research, I have read many comments mentioning that using scp is bad and that one would observe up to 25x speed increase by moving to GridFTP. I was a bit surprised that a transfer method that ancient and respected as scp would fail that badly, in addition to being far from my own experience with the tool.
After having deployed and configured Globus, which involved all sorts of technical setup, I was ready to perform some simple benchmarking. Incidentally, I have heard from many less computationally inclined person (our typical biologist collaborator) that this process is still too complex to be useful. I performed two series of test, the first one between our institute and a local supercomputer through a 1Gbps link and the second one between two Montreal supercomputers connected by a 10Gbps link on the RISQ backbone (if I’m not mistaken). Here are the results obtained:
|Workstation||briaree||1 Gbps ethernet||14 GB||105 MB/s||89 MB/s|
|briaree||guillimin||10 Gbps ethernet||14 GB||152 MB/s||149 MB/s|
|briaree||guillimin||10 Gbps ethernet||129 GB||NA||159 MB/s|
So, since scp reaches almost the theoretical limit of a 1Gbps link (125MB/s), there is not much room for improvement to begin with and as we can see, transfers initiated from Globus are even slower. Maybe this is true only in our current setup and that we would see far better results on faster networks (using Infiniband for instance) but speed is definitely not a reason to uninstall scp just yet.
Globus aims to provide a “simple” way to manage data transfers and share data with collaborators. In that respect, it performs relatively well, providing the user with a simple interface to initiate transfers between servers. This is similar to using a tool like Filezilla with the exception that both endpoints can be remote servers that will communicate directly.
Globus also makes it easy to share a folder (endpoint) with another user in a fashion similar to the sharing capabilities offered by storage services like Dropbox, Google Drive, Owncloud, etc. The main difference here is that Globus does not store any data and only acts as a broker between storage endpoints. And this model raises a big concern regarding data privacy. To make all this work, Globus requires that you provide credentials required to access both endpoints on your behalf and thus, you now need to trust this external entity to do the right thing with them. Sure, they use all sorts of reassuring jargon to promote their service while never changing the fact that a third party can now read any data in your user accounts.
So, no thanks. There must be a use case where this makes sense (unreliable WAN transfers), but I will stick with using scp for now.