3

I am currently using rsync to copy a 73GB file from a Samsung portable SSD T7 to an HPC cluster.

rsync -avh path/to/dataset me@server.somewhere:/path/to/dest

The following applies:

  1. My local machine (where my T7 is connected) is a VirtualBox VM running Ubuntu 20.
  2. The T7 transfer speeds should be up to approx. 1000MB/s.
  3. Network gives me an approximate upload speed of 7.9Mbps.
  4. Rsync transfer speed is probably bottlenecking this to 1-5MB/s according to this answer.

The problem is that the move is still not done after 9 hours. According to 1, using cp instead is better with an empty directory (for the first time). I do not understand this or whether it is actually true. Can someone explain this?

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
mesllo
  • 133
  • 1
  • 5
  • It's going over a network? You need to look at the path and find which link has the minimum speed. That 1000MB/s also represents 8Gbps. – Bib May 15 '22 at 21:05
  • @Bib, sorry I've only just begun using Linux heavily recently. What do you mean by finding which link please? – mesllo May 15 '22 at 21:14
  • I reach the HPC cluster via password-protected ssh, if that is what you asked for. With regards to what the network is, sorry I'm not sure what that means. – mesllo May 15 '22 at 21:52
  • I mean what's the physical network you use to connect from local box to remote one. Over the internet (if yes what is roughly your connection bandwidth), over local wired network (ethernet on the same building), something else? – thanasisp May 15 '22 at 22:04
  • Ah yes. So if I'm understanding correctly, I'm connected over the internet (Ziggo). Current speed tests point to an approximate upload speed of 7.9Mbps. – mesllo May 15 '22 at 22:48
  • 1
    8Mbps is around 1 MB per second. Whatever tool you use to copy 73GB (uncompressed) over this network (upload from your local box), time will be more than 73000/60/60 ~ 20 hours, It would be 1 day. Seems better rsync with -P and probably -z. – thanasisp May 15 '22 at 23:52
  • 1
    The post you linked refers a case with many files, between a hard disk and a usb device, locally. Your case is different. – thanasisp May 15 '22 at 23:54
  • Yes, I figured it should take a day because of the network bottleneck. It's still going, and now I guess I'm too far in to consider other options. I'll wait and hopefully by the end of today it will be done! Thank you. – mesllo May 16 '22 at 09:19

1 Answers1

5

You say that,

  1. Your network speed is approximately 8 Mb/s.
  2. Your rsync runs at about 1-5 MB/s.

Given that 1 MB/s is approximately 10 Mb/s I'd say that rsync is doing you a big favour.

Me, I'd probably have added compression with -z, and since you're using rsync with sensible flags over a network connection it's probably with interrupting it and restarting with compression. It'll just pick up where it left off.

Quick calculation: 73 GB is 73,000 MB*, which is 730,000 Mb. Approximately. You've got a network speed of 8 Mb/s, so that means the copy should take around 730,000/8 = 91,250 seconds. 25 hours, assuming theoretical maximum use of network bandwidth.

* Why 1000:1 instead of 1024:1? Partly because GB:MB strictly is 1000:1, but largely because this is an approximation calculation
‡ Why 10:1 instead of 8:1 as suggested by 8 bits to 1 byte? Two reasons: (a) it's an O(n) approximation, and (b) there's packet/protocol overhead to consider.


Now to try and answer the question strictly as asked, which is, "Is cp faster than rsync during the first run?". If you're using cp you need to have something managing the transport between the local and remote servers; that could be something like sshfs or NFS, or alternatively you might mean scp.

  1. cp over an NFS mounted filesystem. With a well-tuned network this is probably fairly efficient.
  2. cp over sshfs. This will include not only encryption overheads of ssh but also the translation between kernel and userspace for the FUSE implementation of the filesystem. Inefficient.
  3. scp (implicitly over ssh). This will include encryption overhead from ssh. Acceptable, and might futher benefit from link compression (-C).
  4. rsync (implicitly over ssh since we've not mentioned rsyncd). This will include encryption overhead from ssh. Acceptable, and might futher benefit from protocol compression (-z).

I've not taken quantitative performance metrics; these are qualitative assessments. However, note the comments on a similar answer Why is scp so slow and how to make it faster?, although to be fair that's discussing comparative transfer rates of multiple files rather than one large one.

However, where rsync wins across a network connection is that it's restartable. With the right flags (i.e. --partial) even mid-way through a transfer. For a single 73GB file taking around 25 hours to transfer that's a massive potential advantage.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • 73 GB = 584000 Mbit --> 20h 17m at 8 Mbit/s (21h 46m if it's 73 GiB). – Kusalananda May 17 '22 at 06:04
  • @Kusalananda approximation of 1000s for G / M and 10:1 for M to m. It's an O(n) calc not a perfect one – Chris Davies May 17 '22 at 06:32
  • 1
    Just so you know, the rsync took a little just under 25 hours :) – mesllo May 17 '22 at 13:21
  • While this answer does not answer the original question as to whether cp would be faster, I'm going to accept it anyway since it was still very helpful, and the time estimation was very accurately stated. Please suggest if I should re-edit my question title so that it is more aligned with this answer since it's still useful. – mesllo May 17 '22 at 13:23
  • Perfect, thanks! – mesllo May 17 '22 at 13:25
  • 1
    @mesllo I've attempted to answer the question for you strictly as asked – Chris Davies May 17 '22 at 13:34
  • Super informative, thanks. With regards to the --partial flag, I did not specify it in my rsync. I understand that this would keep partially transferred files in the destination, but I also read that this may cause some corruption if you stop and restart the rsync later. Is this correct? Also, given that I stopped and restarted rsync a few times without this --partial flag, if there is a partially transferred file that is suddenly halted, what happens upon restarting? Will the rsync start from that file again, or skip it? – mesllo May 17 '22 at 13:41
  • @mesilo for future reference, you will have to use --append to resume, see rsync resume after interrupted. Test it locally with smaller files to see behaviour. – thanasisp May 17 '22 at 14:58
  • @roaima I think it was ~ 20h for the data, and the rest was the encryption overhead you mention, and I guess some user's upload for other purposes. – thanasisp May 17 '22 at 15:05
  • @mesllo "this [the --partial flag] may cause some corruption if you stop and restart the rsync later. Is this correct?". No. You'll not get corruption. Just make sure you do not use --append unless you can absolutely guarantee 100% that the source file didn't change. Safer not to use it and let rsync figure it out – Chris Davies May 17 '22 at 15:11
  • @roaima After an interrupted transfer with rsync -P, a second rsync without --append or --append-verify doesn't seem to append to the partial file (for what I tested). (And of course talking about a stable source file, one-off big file transfer) – thanasisp May 17 '22 at 19:12
  • @thanasisp it does, but not in-place. First it checksums the existing blocks using its usual algorithm to ensure that what it's got matches on both sides. As it does this it starts to build a fresh copy, which it then continues to extend by copying across the network. You can force an in-place copy with the --in-place flag. You can cut out the sanity check for the already-copied file with --append, but personally I don't usually recommend this as it removes one of the safety checks that the file has copied correctly. – Chris Davies May 17 '22 at 19:26
  • Okay, so just to recap. The rsync command I ran neither did it use --append nor --partial. I interrupted this rsync process a few times. Since I didn't use --append, does it mean that my transfer didn't happen correctly? How can I verify this? At a glance, it seems that it happened just fine, but a sanity check would be nice :) – mesllo May 17 '22 at 20:44
  • 1
    Don't use --append unless you really understand exactly under what circumstances it's safe to do so. It is always safe not to use --append. – Chris Davies May 17 '22 at 20:49
  • 1
    If you want a sanity check on the file contents, md5sum will provide it – Chris Davies May 17 '22 at 20:50