Transferring very large dataset from cluster to a storage server

Question

We have to move a set of very large data (in petabytes) from HPC cluster to a storage server. We have a high capacity communication link between the devices. However, the bottleneck seems to be a fast transfer tool that can be parallelized for individual files (because the individual files are each in terabytes).

In this regard, I am looking for a tool that does not require admin rights and is still considerably faster than scp or rsync. If there is any tool that can be installed locally without admin rights, that will also be useful. I come across this link (What is the fastest way to send massive amounts of data between two computers?), which mentions the netcat way but we couldn't make it work.

For information, we are trying to copy relatively few files with very large size (and not many many small files).

Appreciate your time and help

Interesting! How do you know that the bottleneck is that the transfer tool is not parallelized? Most typically, either the sequential write speed of your storage server, the read speed of your HPC machine(s) or the network / fc connection pose the bottleneck, so that parallelism typically only serves to divide the limiting resource – and thus achieves no speedup. — Marcus Müller, Mar 08 '23 at 16:49
We know the link speed which is in few Gpbs. However, rsync has a peak speed around 40 Mbps. The idea of parallelization is mainly inspired from tools such as axel or aria2 which achieves speed by spawning multiple threads and/or processes. Both of these tools work over HTTP and FTP and thus not helpful in our case — Ikram Ullah, Mar 08 '23 at 20:00
Hm, but if rsync can't do better than 40 MB/s over a > 1 Gb/s link, you will have to investigate what's going wrong there. Throwing parallelism at this problem isn't going to solve this – rsync happily does 190 Mb/s over my ~200 Mb/s residential internet line, to an old laptop on wifi. If your server on HPC cluster hardware and -network can't even fill a single Gb/s, then something is amiss that parallelism probably can't fix. Figure out what that is – unless you're really limited by a single CPU being maxed out by a single-threaded rsync on either (or both) ends, you're not gaining anything. — Marcus Müller, Mar 08 '23 at 20:24
(the reason that some HTTP or FTP transfers are faster when done in multiple parts in parallel is that the server is throttling each individual connection. I bet that's not the case for you here. Why would a storage server throttle such low-speed transfers?) — Marcus Müller, Mar 08 '23 at 20:25
My guess is that the cluster also does some kind of throttling because transfer speed within the cluster disk-drives is similar to transfer speed over-the-network. — Ikram Ullah, Mar 09 '23 at 07:56
Now it may be understood why we want to do multi-threaded/MPI way to bridge the gap as much as possible — Ikram Ullah, Mar 09 '23 at 08:21
why would anyone throttle a HPC cluster? That makes little practical sense. Do the scientific thing: Test. Is one CPU maxed out when you transfer data? — Marcus Müller, Mar 09 '23 at 20:13
HPC has to be quite conservative in scheduling resources (including data in/egress) so it has to do some kind of throttlling, not the other way around (at least that's my understanding). However parallelism can help individual user (I am sure HPC will have checks in place to throttle too much resource allocation for a single user so its not abuse from user perspective per se) — Ikram Ullah, Mar 12 '23 at 06:06
Again, it's great that I learn something better, but I don't believe your cluster will be throttled that much, and I still don't see how parallelism helps a data transfer in that case. Have you asked the people running the cluster for you whether it's the case? — Marcus Müller, Mar 12 '23 at 06:34
Sure, let me give it another try. I need to copy a number of very large files (4+ TB each) from cluster to storage device on same network. In base case, I can run rsync for each file but it achieves a top speed of 35-40 Mbps. On that speed, copying one file can take days. I was thinking to use a tool that find multiple seek points in the file itself and initiates transfer for each of that separately. I know I can run one rsync per file but there are some problems with that like sometime connection is killed by cluster or destination machine — Ikram Ullah, Mar 15 '23 at 12:37
yes, for the sixth time, there is literally zero indication that parallelisation, no matter whether on file or in parts-of-file level will yield any speedup, as far as I can tell. Have you done any testing that it will? — Marcus Müller, Mar 15 '23 at 12:43
For testing, I need a tool that can do parallelization which is what I am looking for. — Ikram Ullah, Mar 16 '23 at 14:27
Just manually start multiple transfers in parallel with --progress; you really don't need a tool to test the hypothesis if whether the speed you get displayed with just one transfer is significantly lower than the sum speeds of multiple manually started rsyncs! — Marcus Müller, Mar 16 '23 at 21:13

Ole Tange · Accepted Answer · 2023-03-17T13:50:40.513

2

I have used rsync + parallel:

Identify all big(ish) files
Run rsync on 30-100 files in parallel

It looks like this (untested):

cd src-dir
find . -type f -size +10000 |
  parallel -j100 -X rsync -zR -Ha --inplace ./{} fooserver:/dest-dir/

To deal with broken connections, use --append for rsync and --retries for parallel:

parallel --retries 987 -j100 -X rsync -zR -Ha --append ./{} fooserver:/dest-dir/

We assume the files do not change at the source.

edited Mar 17 '23 at 13:50

answered Mar 17 '23 at 13:04

Ole Tange

35,514

Actually there seems to be little reason why in this case you shouldn't just fire off a separate rsync for every single file. Inefficiencies for the small files will be overwhelmed by the size of the larger ones. I would add --partial, though, to allow for restarts – Chris Davies Mar 17 '23 at 13:20
@roaima --partial saves a local copy that has to be copied when the transfer is complete. I think --inplace will make more sense here. But the sentiment is the same: Make sure we do not start from scratch when a connection breaks. – Ole Tange Mar 17 '23 at 13:32
Fair point for large files. What about --partial --inplace then. It's safer than --partial --append – Chris Davies Mar 17 '23 at 13:45

Transferring very large dataset from cluster to a storage server

1 Answers1