1

Does a streaming version of split exist somewhere in Linux?

I'm trying to back up a large amount of data via SSH, but SSH's single threaded encryption is limiting the transfer. This machine doesn't have hardware AES support so I'm using ChaCha encryption, but the cpu is still not keeping up with the network.

So I thought I could solve this by splitting the data stream in two, and sending each over a separate SSH connection, and then merging the streams together at the destination. That way the encryption load can be shared over multiple cpu cores. This looked like a general enough idea that it should already exist, but I can't find it.

edit: for some numbers, I'm backing up data from an old computer, a few hundred GB over a gigiabit wired network. I'm copying an image from a partition, as that is faster than doing individual file access on the spinning rust drive, so technically it is random access data, but it is too large to treat it as such. I tried compressing it, but that doesn't help a lot. The data isn't very compressible.

So what I'm looking for is a split (and corresponding merge) that will split a stream of binary data into multiple streams (probably splitting by fixed chunks).

JanKanis
  • 1,069
  • Can you compress the data as you send it? Also, does "large" have a number? What's the system you're fetching the data from (how programmable is it)? What's your connection to it? WiFi or wired? – waltinator Mar 28 '21 at 17:09
  • @KamilMaciorowski I had a look at --filter, but with streaming (where it doesn't know the input size) it only works on textual data and splits line by line, with -n r/2. – JanKanis Mar 28 '21 at 18:26
  • @waltinator I added some numbers to the question – JanKanis Mar 28 '21 at 18:33
  • I doubt that copying a whole partition, filesystem metadata, free space and all (including errors) is faster than using rsync to traverse the filesystems you want to save. Read man rsync. – waltinator Mar 29 '21 at 00:52
  • @waltinator That depends on how full the filesystem is. This one was quite full. I tried sending files using tar before trying to image the whole partition, but that sometimes slowed down to mere kilobytes per second when recursing in directories with many small files (mail folders). I guess we're all spoilt nowadays with SSDs, and we're forgetting how slow random access on magnetic hdds can be. – JanKanis Mar 29 '21 at 08:19
  • frame challenge: if you can do this on a local network (where network sniffing is not your main threat), you could nix with SSH, and just dump the data over netcat. – ilkkachu Mar 29 '21 at 14:25
  • @ilkkachu That's also an option, but I prefer to be safe and always use encryption. And this question is also for posterity, a streaming split could be handy in other situations. – JanKanis Mar 29 '21 at 14:33
  • In principle, a few lines of Perl could probably stripe the data and recombine it in the other end. But I wonder how much the extra nuisance and testing is worth. You didn't show numbers, but I got something like 85 MB/s over a GigE network with chacha20-poly1305@openssh.com when reading from /dev/zero, on a 10+ year old Core 2 CPU. It could be better, but doesn't seem horrible. (I'm not sure how much the overhead from SSH and the lower layers would be, exactly, so what the theoretical upper limit is.) – ilkkachu Mar 29 '21 at 14:38
  • @JanKanis, yes, sure. Just something that came to mind. – ilkkachu Mar 29 '21 at 14:38
  • @ilkkachu I considered writing something myself, but that wasn't worth it for just this particular problem. I might still do so as it would be a useful generic tool. – JanKanis Mar 29 '21 at 14:50

4 Answers4

1

Manual "split"

If the (local) source and the (remote) destination files are seekable then a straightforward solution is to manually calculate offsets and sizes and run two or more pipelines, each transferring one fragment of the file. Example (on the source side):

dd if=/source/file bs=10M count=1000 skip=0 iflag=fullblock \
| ssh user@server 'dd of=/destination/file bs=10M seek=0 conv=notrunc'

This will transfer the first 1000 MiB (1000x10M) of data. Then you use skip=1000 and seek=1000 to transfer the second 1000 MiB. Then skip=2000 and seek=2000. And so on. You can run two or more such commands simultaneously.

In fact the command transferring the first part does not need skip nor seek (because 0 is the default for both); and the command transferring the last part does not need count (because both dd will terminate on EOF).

So if you want to run N ssh processes simultaneously and cover the whole /source/file then pick bs, get a calculator and compute a common count and respective skip/seek values for each of N almost equal parts. The last part can be somewhat smaller or somewhat bigger, simply don't specify the count for it.

Notes:

  • bs=10M may not be optimal for you. Change it and recalculate if needed.

  • conv=notrunc is important for every chunk but the final one.

  • The usage of seek= can expand the /destination/file instantaneously as sparse. This may lead to fragmentation. To avoid it, use fallocate on /destination/file beforehand. Note some (old, non-*nix) filesystems don't support expanding a file without actually writing something, so seek= or fallocate may take long time when zeros are being written.

  • iflag=fullblock is not portable. In a general case the flag is important. When reading from a regular /source/file you probably don't need it. But if you can use it then use it. The point is even a partial read counts towards count; without iflag=fullblock the local dd could reach count and stop reading too soon.

    An alternative solution is like dd skip=… … | head -c … | ssh …, where there is no count=. I guess if dd read from a non-seekable stream then skip without iflag=fullblock could skip less than it should. But when reading from a regular file skip just moves the pointer, it should be safe. head -c is not portable though.

  • GNU dd supports iflag=skip_bytes,count_bytes and oflag=seek_bytes. These can simplify your calculations greatly. E.g. the following commands will transfer the first 200 GiB of the /source/file, the following 300 GiB and the rest, respectively:

    dd if=/source/file bs=10M iflag=count_bytes                      count=200G | ssh user@server 'dd of=/destination/file bs=10M conv=notrunc'
    dd if=/source/file bs=10M iflag=skip_bytes,count_bytes skip=200G count=300G | ssh user@server 'dd of=/destination/file bs=10M conv=notrunc oflag=seek_bytes seek=200G'
    dd if=/source/file bs=10M iflag=skip_bytes             skip=500G            | ssh user@server 'dd of=/destination/file bs=10M              oflag=seek_bytes seek=500G'
    

    Of course to solve your problem you should run these commands in parallel (e.g. in separate terminals). iflag=fullblock is not needed because the tools count bytes, not blocks. Note the /destination/file will grow to at least 500G even if the /source/file is smaller.

Disadvantage

This method reads the /source/file and writes to the /destination/file non-sequentially, it keeps jumping from one offset to another. If a mechanical drive is involved, the method may perform badly and/or strain the drive.


Alternative method

This method can transfer data from a seekable or non-seekable source to a seekable or non-seekable destination. It doesn't need to know the size of the data in advance. It reads and writes sequentially.

It's possible to create a single script that, when run on the sender, will set everything up and run the right code on the receiver. I decided to keep the solution relatively simple, so there is a piece of code to run on the sender, a piece of code to run on the receiver, plus some manual plumbing to do.

On the sending machine you will need bash, ifne, mbuffer and head with -c support. On the receiving machine you will need the same set of tools. Some requirements can be loosened, check the notes down below.

Save the following code as sender on the sending machine, make it executable:

#!/bin/bash

declare -A fd mkfifo "$@" || exit 1

for f do exec {fd["$f"]}>"$f" done

while :; do for f do head -c 5M | ifne -n false >&"${fd["$f"]}" || break 2 done done

rm "$@"

Save the following code as receiver on the receiving machine, make it executable:

#!/bin/bash

declare -A fd mkfifo "$@" || exit 1

for f do exec {fd["$f"]}<"$f" done

while :; do for f do <&"${fd["$f"]}" head -c 5M | ifne -n false || break 2 done done

rm "$@"

The scripts are very similar.

On the sending machine run the sender with an arbitrary number of arguments:

</source/file ./sender /path/to/fifoS1 /path/to/fifoS2 /path/to/fifoS3

This will create fifos fifoS1, fifoS2, fifoS3. Use fewer or more fifos, it's up to you. Paths to fifos may be relative. The sender will wait for something to read data from the fifos.

On the receiving machine run the receiver with the same number of arguments:

>/destination/file ./receiver /location/of/fifoR1 /location/of/fifoR2 /location/of/fifoR3

This will create fifos fifoR1, fifoR2, fifoR3. Paths to fifos may be relative. The receiver will wait for something to write data to the fifos.

On the sending machine connect each local fifo to the respective remote fifo via mbuffer, ssh and remote mbuffer:

</path/to/fifoS1 mbuffer -m 10M | ssh user@server 'mbuffer -q -m 10M >/location/of/fifoR1'
</path/to/fifoS2 mbuffer -m 10M | ssh user@server 'mbuffer -q -m 10M >/location/of/fifoR2'
# and so on

Run these commands in parallel. The data will start flowing after you connect all the local fifos to their remote counterparts. In case it's not obvious: Nth argument provided to receiver is the counterpart of the Nth argument provided to sender.

Notes:

  • sender can read from a pipe; receiver can write to a pipe.

  • When transferring a seekable source to a seekable destination, you can resume interrupted transfer by piping dd skip=… … to sender and by piping receiver to dd seek=… …. Remember the limitations of dd and the abilities of GNU dd discussed in the previous section of this answer.

  • -c for head in the scripts is like -b for split. Chunks of this size are sent to fifos an a round-robin fashion. If you decide to change -c 5M, use identical values in both scripts; this is crucial. Also adjust -m for mbuffers accordingly.

  • mbuffers and the size of their buffers are important for performance.

    • If buffers are too small on the sending side then sender will slow down, waiting for one slowly encrypting ssh, while other sshs will idly wait for the sender to move on to them. I think for each mbuffer on the sending side you should use at least the size of one chunk. In the example I'm using twice as much (-m 10M).
    • If buffers are too small on the receiving side and receiver manages to deplete one of the buffers before it moves to the next pipe, then it may happen that other sshs will stall, because mbuffers they write to will get full. I think for each mbuffer on the receiving side you should use at least the size of one chunk. In the example I'm using twice as much (-m 10M).
  • bash is needed to handle file descriptors, when the number of fifos is not known in advance. Porting the scripts to pure sh should be relatively easy in case of a fixed number of fifos.

  • head can be replaced by dd with iflag=fullblock support.

  • The mbuffer+ssh pipelines should terminate automatically after the sender or the receiver is terminated. You won't need to terminate these pipelines by hand.

  • ifne is used to detect EOF. You can remove each ifne from the sender, but then you will need to manually terminate the script and remove the fifos after their job is done. A good way to know when the job is done is preceding ./sender with (cat; echo "done" >&2):

    (</source/file cat; echo "done" >&2) | ./sender …
    

    When you see done, wait for each mbuffer to get empty and idle. Then you can terminate the sender, e.g. with Ctrl+C.

    Similarly you can remove each ifne from the receiver and manually terminate it when you are reasonably sure it's done. Note if receiver pipes to some other command then in general you cannot easily tell it's done. And even if you think you can tell, then Ctrl+C may not be the best idea because it will send SIGINT to the other command as well.

    It's so much easier with ifne. In Debian ifne is in the moreutils package.

1

Something like this:

parallel --block -1 --pipepart --recend '' -a bigfile 'ssh dst cat \>{#}'

When it is done cat the files together:

ssh dst cat $(seq $(parallel --number-of-threads)) '>' bigfile.bak

You need to have room for 2x bigfile.

Ole Tange
  • 35,514
1

Indeed GNU parallel can do the majority of the boilerplate operations that would otherwise be needed to tackle this problem, as it can read a seekable file directly in chunks and feed each of them to concurrent jobs. However, especially if operating on a spinning HDD, I would deem it advisable to instruct parallel to manage the file in fixed --block size chunks rather than as a whole. The receiver side would be handled by each of these jobs by simply having them deliver their specific chunk remotely through a dd seek= + copy command directly onto the destination file.

Here's a fully functional script that accomplishes everything:

#!/bin/sh --

For improved performance, and even in presence of pubkey authentication,

it is advisable to use ssh master mode, so that each job spawn by parallel

does not undergo the entire authentication handshake. Otherwise a

several-hundred-GBs file, even split in chunks of 100MB each, would

force the whole operation to spend several minutes as a total just on ssh

authentication handshakes.

trap 'for c in c?; do ssh -S "$c" -O exit .; done' EXIT for c in $(seq 1 $(parallel --number-of-cores)); do ssh -fnNMS "c$c" -o ControlPersist=yes "$@" || exit done

let's say 10 megabytes each chunk

csize="$((1010241024))"

parallel --pipepart -a /dev/sr0 --recend '' --block "$csize" ' ssh -S c{%} localhost "{ dd bs='"$csize"' seek="$(({#} - 1))" count=1
iflag=fullblock conv=sparse,notrunc # && \

head -c '"$csize"'

    } 1&lt;&gt;test 2&gt;/dev/null&quot;

'

Note this assumes GNU dd on the destination side, which supports

the iflag=fullblock operation mode. Lacking that, one would use dd

only as a seeking-capable command, hence employing a count=0 option in

place of the count=1, operating over the same file-descriptor later used

by a head -c command as in the commented-out line.

Note also that conv=sparse option is not POSIX, though it is supported by

both GNU and BSD dd. In case your dd does not support that option then

just leave it out and dd will simply operate somewhat slower (YMMV).

Note that the commented-out lines (though not the comment-only lines) are better stripped off the actual script as some shells do not support embedded comments.

You would run the above script on the sender machine as in:

$ sh script /dev/sda user@receiver-machine

The idea of the operation is to use a chunk size big enough to saturate the HDD speed without straining its heads while parallel jobs keep a steady reading on disk sectors (assuming you'd read a /dev/sdaX directly) adjacent enough to employ the kernel's own block-device cache. This would be impossible to obtain if we let parallel handle the entire serveral-hundred-GBs file because its concurrent jobs would request seek&read operations far-away from each other, hence forcing the disk to move its heads back and forth while also continuously invalidating the kernel's block-device cache (thus possibly also swapping in & out). I'm of course also assuming that your old machine does not sport several-hundred-GBs of RAM..


FWIW note that you could also use the regular split command for this case too, and it would be perfectly safe. Just it would not be as performant. I did a few quick tests, and split actually resulted even slower than an alternative "boilerplate" shell-only script I made just occasionally for comparison, let alone the solution using GNU parallel.

Anyways, just for the records, a natural companion tool for split in cases like these is paste, which essentially does the "round-robin" reading that can match the round-robin writing performed by a split -n r/X.

Here's a practical example of how it could be for a 4-splits transmission:

#!/bin/bash --

sfile="${1:?}"; shift

Side note: uncomment the following lines if you don't have ssh-keys to authenticate

to the sender machine and thus you need to enter passwords. Note that this requires

master mode enabled and allowed by the ssh client, and connection multiplexing enabled

and allowed by the ssh server. Both are typically enabled by default in OpenSSH.

#for cnum in {1..4}; do

ssh -fnNMS "c$cnum" -o ControlPersist=yes "$@" || exit

#done #trap 'for cnum in {1..4}; do ssh -S "c$cnum" -O exit -; done' EXIT

LC_CTYPE=C paste -d \n
<(ssh -S c1 "$@" "LC_ALL=C split -n r/1/4 $sfile")
<(ssh -S c2 "$@" "LC_ALL=C split -n r/2/4 $sfile")
<(ssh -S c3 "$@" "LC_ALL=C split -n r/3/4 $sfile")
<(ssh -S c4 "$@" "LC_ALL=C split -n r/4/4 $sfile")

Made as a script, you would run the above on the receiver machine as in:

$ bash script /dev/sda user@sender-machine > file

After that, it is highly likely that you would need to adjust the size of the file created on the receiver machine (A detailed explanation as to why this is often required can be provided if interested). You can adjust that through a simple:

$ truncate -s <real-size> file

That's all.

Possibly note one (mostly theoretical) caveat being that, because split -n r/.. by its nature splits over newlines, should the input data have no newlines at all then there would be no splitting at all, and the whole amount of data would be transferred over one single connection of the four in the example.

HTH

LL3
  • 5,418
0

As I needed a problem to learn Rust, I decided to write a program that solves this myself. It is available at https://github.com/JanKanis/streamsplit. Example:

streamsplit -i ./mybigfile split 2 'ssh remoteserver streamsplit merge -o ./destinationfile'
JanKanis
  • 1,069