Manual "split"
If the (local) source and the (remote) destination files are seekable then a straightforward solution is to manually calculate offsets and sizes and run two or more pipelines, each transferring one fragment of the file. Example (on the source side):
dd if=/source/file bs=10M count=1000 skip=0 iflag=fullblock \
| ssh user@server 'dd of=/destination/file bs=10M seek=0 conv=notrunc'
This will transfer the first 1000 MiB (1000
x10M
) of data. Then you use skip=1000
and seek=1000
to transfer the second 1000 MiB. Then skip=2000
and seek=2000
. And so on. You can run two or more such commands simultaneously.
In fact the command transferring the first part does not need skip
nor seek
(because 0
is the default for both); and the command transferring the last part does not need count
(because both dd
will terminate on EOF).
So if you want to run N ssh
processes simultaneously and cover the whole /source/file
then pick bs
, get a calculator and compute a common count
and respective skip
/seek
values for each of N almost equal parts. The last part can be somewhat smaller or somewhat bigger, simply don't specify the count
for it.
Notes:
bs=10M
may not be optimal for you. Change it and recalculate if needed.
conv=notrunc
is important for every chunk but the final one.
The usage of seek=
can expand the /destination/file
instantaneously as sparse. This may lead to fragmentation. To avoid it, use fallocate
on /destination/file
beforehand. Note some (old, non-*nix) filesystems don't support expanding a file without actually writing something, so seek=
or fallocate
may take long time when zeros are being written.
iflag=fullblock
is not portable. In a general case the flag is important. When reading from a regular /source/file
you probably don't need it. But if you can use it then use it. The point is even a partial read counts towards count
; without iflag=fullblock
the local dd
could reach count
and stop reading too soon.
An alternative solution is like dd skip=… … | head -c … | ssh …
, where there is no count=
. I guess if dd
read from a non-seekable stream then skip
without iflag=fullblock
could skip less than it should. But when reading from a regular file skip
just moves the pointer, it should be safe. head -c
is not portable though.
GNU dd
supports iflag=skip_bytes,count_bytes
and oflag=seek_bytes
. These can simplify your calculations greatly. E.g. the following commands will transfer the first 200 GiB of the /source/file
, the following 300 GiB and the rest, respectively:
dd if=/source/file bs=10M iflag=count_bytes count=200G | ssh user@server 'dd of=/destination/file bs=10M conv=notrunc'
dd if=/source/file bs=10M iflag=skip_bytes,count_bytes skip=200G count=300G | ssh user@server 'dd of=/destination/file bs=10M conv=notrunc oflag=seek_bytes seek=200G'
dd if=/source/file bs=10M iflag=skip_bytes skip=500G | ssh user@server 'dd of=/destination/file bs=10M oflag=seek_bytes seek=500G'
Of course to solve your problem you should run these commands in parallel (e.g. in separate terminals). iflag=fullblock
is not needed because the tools count bytes, not blocks. Note the /destination/file
will grow to at least 500G
even if the /source/file
is smaller.
Disadvantage
This method reads the /source/file
and writes to the /destination/file
non-sequentially, it keeps jumping from one offset to another. If a mechanical drive is involved, the method may perform badly and/or strain the drive.
Alternative method
This method can transfer data from a seekable or non-seekable source to a seekable or non-seekable destination. It doesn't need to know the size of the data in advance. It reads and writes sequentially.
It's possible to create a single script that, when run on the sender, will set everything up and run the right code on the receiver. I decided to keep the solution relatively simple, so there is a piece of code to run on the sender, a piece of code to run on the receiver, plus some manual plumbing to do.
On the sending machine you will need bash
, ifne
, mbuffer
and head
with -c
support. On the receiving machine you will need the same set of tools. Some requirements can be loosened, check the notes down below.
Save the following code as sender
on the sending machine, make it executable:
#!/bin/bash
declare -A fd
mkfifo "$@" || exit 1
for f do
exec {fd["$f"]}>"$f"
done
while :; do
for f do
head -c 5M | ifne -n false >&"${fd["$f"]}" || break 2
done
done
rm "$@"
Save the following code as receiver
on the receiving machine, make it executable:
#!/bin/bash
declare -A fd
mkfifo "$@" || exit 1
for f do
exec {fd["$f"]}<"$f"
done
while :; do
for f do
<&"${fd["$f"]}" head -c 5M | ifne -n false || break 2
done
done
rm "$@"
The scripts are very similar.
On the sending machine run the sender
with an arbitrary number of arguments:
</source/file ./sender /path/to/fifoS1 /path/to/fifoS2 /path/to/fifoS3
This will create fifos fifoS1
, fifoS2
, fifoS3
. Use fewer or more fifos, it's up to you. Paths to fifos may be relative. The sender
will wait for something to read data from the fifos.
On the receiving machine run the receiver
with the same number of arguments:
>/destination/file ./receiver /location/of/fifoR1 /location/of/fifoR2 /location/of/fifoR3
This will create fifos fifoR1
, fifoR2
, fifoR3
. Paths to fifos may be relative. The receiver
will wait for something to write data to the fifos.
On the sending machine connect each local fifo to the respective remote fifo via mbuffer
, ssh
and remote mbuffer
:
</path/to/fifoS1 mbuffer -m 10M | ssh user@server 'mbuffer -q -m 10M >/location/of/fifoR1'
</path/to/fifoS2 mbuffer -m 10M | ssh user@server 'mbuffer -q -m 10M >/location/of/fifoR2'
# and so on
Run these commands in parallel. The data will start flowing after you connect all the local fifos to their remote counterparts. In case it's not obvious: Nth argument provided to receiver
is the counterpart of the Nth argument provided to sender
.
Notes:
sender
can read from a pipe; receiver
can write to a pipe.
When transferring a seekable source to a seekable destination, you can resume interrupted transfer by piping dd skip=… …
to sender
and by piping receiver
to dd seek=… …
. Remember the limitations of dd
and the abilities of GNU dd
discussed in the previous section of this answer.
-c
for head
in the scripts is like -b
for split
. Chunks of this size are sent to fifos an a round-robin fashion. If you decide to change -c 5M
, use identical values in both scripts; this is crucial. Also adjust -m
for mbuffer
s accordingly.
mbuffer
s and the size of their buffers are important for performance.
- If buffers are too small on the sending side then
sender
will slow down, waiting for one slowly encrypting ssh
, while other ssh
s will idly wait for the sender
to move on to them. I think for each mbuffer
on the sending side you should use at least the size of one chunk. In the example I'm using twice as much (-m 10M
).
- If buffers are too small on the receiving side and
receiver
manages to deplete one of the buffers before it moves to the next pipe, then it may happen that other ssh
s will stall, because mbuffer
s they write to will get full. I think for each mbuffer
on the receiving side you should use at least the size of one chunk. In the example I'm using twice as much (-m 10M
).
bash
is needed to handle file descriptors, when the number of fifos is not known in advance. Porting the scripts to pure sh
should be relatively easy in case of a fixed number of fifos.
head
can be replaced by dd
with iflag=fullblock
support.
The mbuffer
+ssh
pipelines should terminate automatically after the sender
or the receiver
is terminated. You won't need to terminate these pipelines by hand.
ifne
is used to detect EOF. You can remove each ifne
from the sender
, but then you will need to manually terminate the script and remove the fifos after their job is done. A good way to know when the job is done is preceding ./sender
with (cat; echo "done" >&2)
:
(</source/file cat; echo "done" >&2) | ./sender …
When you see done
, wait for each mbuffer
to get empty and idle. Then you can terminate the sender
, e.g. with Ctrl+C.
Similarly you can remove each ifne
from the receiver
and manually terminate it when you are reasonably sure it's done. Note if receiver
pipes to some other command then in general you cannot easily tell it's done. And even if you think you can tell, then Ctrl+C may not be the best idea because it will send SIGINT
to the other command as well.
It's so much easier with ifne
. In Debian ifne
is in the moreutils
package.
--filter
, but with streaming (where it doesn't know the input size) it only works on textual data and splits line by line, with-n r/2
. – JanKanis Mar 28 '21 at 18:26rsync
to traverse the filesystems you want to save. Readman rsync
. – waltinator Mar 29 '21 at 00:52tar
before trying to image the whole partition, but that sometimes slowed down to mere kilobytes per second when recursing in directories with many small files (mail folders). I guess we're all spoilt nowadays with SSDs, and we're forgetting how slow random access on magnetic hdds can be. – JanKanis Mar 29 '21 at 08:19chacha20-poly1305@openssh.com
when reading from/dev/zero
, on a 10+ year old Core 2 CPU. It could be better, but doesn't seem horrible. (I'm not sure how much the overhead from SSH and the lower layers would be, exactly, so what the theoretical upper limit is.) – ilkkachu Mar 29 '21 at 14:38