4

I have created a dataset of millions (>15M, so far) of images for a machine-learning project, taking up over 500GB of storage. I created them on my Macbook Pro but want to get them to our DGX1 (GPU cluster) somehow. I thought it would be faster to copy to a fast external SSD (2x nvme in raid0) and then plug that drive directly into local terminal and copy it to the network scratch disk. I'm not so sure anymore, as I've been cp-ing to the external drive for over 24 hrs now.

I tried using the finder gui to copy at first (bad idea!). For a smaller dataset (2M images), I used 7zip to create a few archives. I'm now using the terminal in MacOS to copy the files using cp.

I tried cp /path/to/dataset /path/to/external-ssd.

Finder was definitely not the best approach as it took forever at the "preparing" to copy stage.

Using 7zip to archive the dataset increased the "file" transfer speed, but it took over 4 days(!) to extract the files, and that for a dataset an order of magnitude smaller.

Using the command line cp, started off quickly but seems to have slowed down. Activity monitor says I'm getting 6-8k IO's on the disk. Well, maybe. iostat reports somewhere between 14-16 MB/s, at least during random spot checks. It's been 24 hours and it isn't quite halfway done.

Is there a better way to do this?

It wasn't clear to me that rsync was any better than cp for my purpose: How to copy a file from a remote server to a local machine?

SciGuy
  • 207
  • Do you have ssh access to the cluster? – Iñaki Murillo May 20 '19 at 18:00
  • I do, but it would be over WiFi. Security is being tightened a lot after an embarrassing infosec audit, so I can't plug my machine into the wired ethernet anymore. – SciGuy May 20 '19 at 18:26
  • 3
    What sustained speed do you get over WiFi? (If it approximates at least half the I/O speed to the external disk it's a win.) Do you have rsync on your Mac and on the target cluster? (rsync is better than cp/scp for situations that (may) require a restart; otherwise it's no faster and can sometimes be slower.) What filesystem type did you use for the external disk? (NTFS is horribly slow on non-Windows systems.) – Chris Davies May 20 '19 at 22:01
  • Yes, rsync is a standard install on macs. 99% sure it's installed on the destination server. I used exFAT for my SSD for max interoperability. For regular WiFi uses, I get aroubd 2-3 MB/s. To connect the server, I have to VPN into the institution-wide network, and then I have to ssh in from there into the server. I'm not sure what the overhead cost is on that. I guess I was thinking stopping the current cp process in favor of some type of block level copy (dd?), but I only know how to do that to clone whole drives and not a specific directory/folder. – SciGuy May 20 '19 at 23:03

1 Answers1

3
  1. Archiving your data is a good option as for the file transfer speed. However, if those images are mostly JPEGs, data is already compressed, and you'll end up wasting CPU time compressing data for a final 1 or 2% gain in file size. This is why you may give a try to tar since it only packs files together without trying to compress them (unless you ask it to ;-)

  2. Another hack that may be worth trying if your network setup allows it is to start a web server on your laptop, then download them from the destination host. This cuts the process from "copy from laptop to external media" + "copy from external media to destination" into a single-step process. I've practiced it many times (between Linux machines) and it works pretty well.

This is detailed here. The main steps are :

On the sender side :

  1. cd to the directory containing files to share
  2. start a web server with Python :
    • with Python 2 : python -m SimpleHTTPServer port
    • with Python 3 : python -m http.server port

On the receiver side, files will be available at http://senderIp:port. You can retrieve files easily with wget -c http://senderIp:port/yourArchiveName

Httqm
  • 1,136