9

The files 1..64 are each 160 MBytes and stored in a RAM disk.

Generated by:

seq 120 | parallel -k 'seq {}0000000 {}9999999 | fmt -30' | head -c 10G > 10G
parallel --pipepart --block -1 -a 10G -j 64 'cat > {#}'

nocat:

#!/bin/bash

export LC_ALL=C sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 1; sort ) < 1)
<((rm 2; sort ) < 2) )
<(sort -m
<((rm 3; sort ) < 3)
<((rm 4; sort ) < 4) ) )
<(sort -m
<(sort -m
<((rm 5; sort ) < 5)
<((rm 6; sort ) < 6) )
<(sort -m
<((rm 7; sort ) < 7)
<((rm 8; sort ) < 8) ) ) )
<(sort -m
<(sort -m
<(sort -m
<((rm 9; sort ) < 9)
<((rm 10; sort ) < 10) )
<(sort -m
<((rm 11; sort ) < 11)
<((rm 12; sort ) < 12) ) )
<(sort -m
<(sort -m
<((rm 13; sort ) < 13)
<((rm 14; sort ) < 14) )
<(sort -m
<((rm 15; sort ) < 15)
<((rm 16; sort ) < 16) ) ) ) )
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 17; sort ) < 17)
<((rm 18; sort ) < 18) )
<(sort -m
<((rm 19; sort ) < 19)
<((rm 20; sort ) < 20) ) )
<(sort -m
<(sort -m
<((rm 21; sort ) < 21)
<((rm 22; sort ) < 22) )
<(sort -m
<((rm 23; sort ) < 23)
<((rm 24; sort ) < 24) ) ) )
<(sort -m
<(sort -m
<(sort -m
<((rm 25; sort ) < 25)
<((rm 26; sort ) < 26) )
<(sort -m
<((rm 27; sort ) < 27)
<((rm 28; sort ) < 28) ) )
<(sort -m
<(sort -m
<((rm 29; sort ) < 29)
<((rm 30; sort ) < 30) )
<(sort -m
<((rm 31; sort ) < 31)
<((rm 32; sort ) < 32) ) ) ) ) )
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 33; sort ) < 33)
<((rm 34; sort ) < 34) )
<(sort -m
<((rm 35; sort ) < 35)
<((rm 36; sort ) < 36) ) )
<(sort -m
<(sort -m
<((rm 37; sort ) < 37)
<((rm 38; sort ) < 38) )
<(sort -m
<((rm 39; sort ) < 39)
<((rm 40; sort ) < 40) ) ) )
<(sort -m
<(sort -m
<(sort -m
<((rm 41; sort ) < 41)
<((rm 42; sort ) < 42) )
<(sort -m
<((rm 43; sort ) < 43)
<((rm 44; sort ) < 44) ) )
<(sort -m
<(sort -m
<((rm 45; sort ) < 45)
<((rm 46; sort ) < 46) )
<(sort -m
<((rm 47; sort ) < 47)
<((rm 48; sort ) < 48) ) ) ) )
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 49; sort ) < 49)
<((rm 50; sort ) < 50) )
<(sort -m
<((rm 51; sort ) < 51)
<((rm 52; sort ) < 52) ) )
<(sort -m
<(sort -m
<((rm 53; sort ) < 53)
<((rm 54; sort ) < 54) )
<(sort -m
<((rm 55; sort ) < 55)
<((rm 56; sort ) < 56) ) ) )
<(sort -m
<(sort -m
<(sort -m
<((rm 57; sort ) < 57)
<((rm 58; sort ) < 58) )
<(sort -m
<((rm 59; sort ) < 59)
<((rm 60; sort ) < 60) ) )
<(sort -m
<(sort -m
<((rm 61; sort ) < 61)
<((rm 62; sort ) < 62) )
<(sort -m
<((rm 63; sort ) < 63)
<((rm 64; sort ) < 64) ) ) ) ) ) | md5sum

withcat:

#!/bin/bash

export LC_ALL=C sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 1; sort ) < 1)
<((rm 2; sort ) < 2) | cat)
<(sort -m
<((rm 3; sort ) < 3)
<((rm 4; sort ) < 4) | cat) | cat)
<(sort -m
<(sort -m
<((rm 5; sort ) < 5)
<((rm 6; sort ) < 6) | cat)
<(sort -m
<((rm 7; sort ) < 7)
<((rm 8; sort ) < 8) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<((rm 9; sort ) < 9)
<((rm 10; sort ) < 10) | cat)
<(sort -m
<((rm 11; sort ) < 11)
<((rm 12; sort ) < 12) | cat) | cat)
<(sort -m
<(sort -m
<((rm 13; sort ) < 13)
<((rm 14; sort ) < 14) | cat)
<(sort -m
<((rm 15; sort ) < 15)
<((rm 16; sort ) < 16) | cat) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 17; sort ) < 17)
<((rm 18; sort ) < 18) | cat)
<(sort -m
<((rm 19; sort ) < 19)
<((rm 20; sort ) < 20) | cat) | cat)
<(sort -m
<(sort -m
<((rm 21; sort ) < 21)
<((rm 22; sort ) < 22) | cat)
<(sort -m
<((rm 23; sort ) < 23)
<((rm 24; sort ) < 24) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<((rm 25; sort ) < 25)
<((rm 26; sort ) < 26) | cat)
<(sort -m
<((rm 27; sort ) < 27)
<((rm 28; sort ) < 28) | cat) | cat)
<(sort -m
<(sort -m
<((rm 29; sort ) < 29)
<((rm 30; sort ) < 30) | cat)
<(sort -m
<((rm 31; sort ) < 31)
<((rm 32; sort ) < 32) | cat) | cat) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 33; sort ) < 33)
<((rm 34; sort ) < 34) | cat)
<(sort -m
<((rm 35; sort ) < 35)
<((rm 36; sort ) < 36) | cat) | cat)
<(sort -m
<(sort -m
<((rm 37; sort ) < 37)
<((rm 38; sort ) < 38) | cat)
<(sort -m
<((rm 39; sort ) < 39)
<((rm 40; sort ) < 40) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<((rm 41; sort ) < 41)
<((rm 42; sort ) < 42) | cat)
<(sort -m
<((rm 43; sort ) < 43)
<((rm 44; sort ) < 44) | cat) | cat)
<(sort -m
<(sort -m
<((rm 45; sort ) < 45)
<((rm 46; sort ) < 46) | cat)
<(sort -m
<((rm 47; sort ) < 47)
<((rm 48; sort ) < 48) | cat) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<(sort -m
<((rm 49; sort ) < 49)
<((rm 50; sort ) < 50) | cat)
<(sort -m
<((rm 51; sort ) < 51)
<((rm 52; sort ) < 52) | cat) | cat)
<(sort -m
<(sort -m
<((rm 53; sort ) < 53)
<((rm 54; sort ) < 54) | cat)
<(sort -m
<((rm 55; sort ) < 55)
<((rm 56; sort ) < 56) | cat) | cat) | cat)
<(sort -m
<(sort -m
<(sort -m
<((rm 57; sort ) < 57)
<((rm 58; sort ) < 58) | cat)
<(sort -m
<((rm 59; sort ) < 59)
<((rm 60; sort ) < 60) | cat) | cat)
<(sort -m
<(sort -m
<((rm 61; sort ) < 61)
<((rm 62; sort ) < 62) | cat)
<(sort -m
<((rm 63; sort ) < 63)
<((rm 64; sort ) < 64) | cat) | cat) | cat) | cat) | cat) | cat | md5sum

The only difference is that in withcat every sort -m is piped to cat.

The "useless use of cat"-people will make you believe withcat would be slower than nocat. However, the opposite is true by a wide margin:

$ time bash nocat
c933d81faea7b8dec8eb64ca0b044d74  -

real 3m40.854s user 2m48.687s sys 0m49.135s $ time bash withcat c933d81faea7b8dec8eb64ca0b044d74 -

real 2m21.812s user 2m16.651s sys 1m36.135s

The test is run on a 64-core machine, that does nothing else. Everything is in RAM (so this is not due to slow disks). Every test was run 3 times, and the best time is given above. All three tests completed within 5 seconds of the best time (so it is not a fluke).

Why is it faster to pipe the output to cat?

Edit

Does cat group input in bigger chunks? And/or does sort flush output for every line?

To test this I tried:

$ strace -ff sort -m <(sort 1) <(sort 2) 2>fromsort | cat >/dev/null
$ strace -ff sort -m <(sort 1 | cat ) <(sort 2 | cat) 2>fromcat | cat >/dev/null

If cat made it into bigger chunks, we would expect read to return larger chunks. But it does not:

$ grep -E 'read|write' fromsort |field 1,5|sort | uniq -c 
      1 openat(AT_FDCWD,        3
      8 pread64(3,      =
      1 read(3, 3771
  40989 read(3, 4096
      2 read(3, 832
      1 read(3, unknown
      1 read(4, 0
      1 read(4, 2241
  40959 read(4, 4096
      1 write(1,        1916
  81949 write(1,        4096
$ grep -E 'read|write' fromcat |field 1,5|sort | uniq -c 
      1 openat(AT_FDCWD,        3
      8 pread64(3,      =
      1 read(3, 3771
  40989 read(3, 4096
      2 read(3, 832
      1 read(3, unknown
      1 read(4, 2241
  40959 read(4, 4096
      1 read(4, unknown
      1 write(1,        1916
  81949 write(1,        4096

In both cases both read and write are 4K.

(Incidentally, sort does read (much) bigger chunks if reading from a file and not from a pipe, but that is not the case here).

Edit 2

The goal of the above is to demonstrate that an additional cat is not always useless; and to figure out what causes this.

The goal is not to sort data.

But if your goal was to sort data, why not just use sort's built-in --parallel?

By default sort seems to use --parallel 8 on a 64-core machine. top shows it uses up to 800% CPU. You can force it to use 64 cores with --parallel 64:

$ time sort {1..64} | md5sum
real    9m4.005s
user    29m56.454s
sys     5m49.560s

$ time sort --parallel 64 {1..64} | md5sum real 6m50.332s user 35m55.040s sys 11m37.609s

So GNU sort's --parallel is much slower than the above. The above is now available as parsort: http://git.savannah.gnu.org/cgit/parallel.git/tree/src/parsort

Ole Tange
  • 35,514
  • You would have more luck if you tried with a less example. For starting sort -m fileA fileB is alternating blocking reads from fileA with those from fileB and adding another buffering layer (which a UUoC is doing, by putting another pipe in between) may cause it to always have what to read from the pipe, and not have to wait for the producer to put it there. But that is simply a fluke; the shells process substs and general purpose tools like sort are not appropriate for taking advantage of multiprocessing machines (even of my "4-core" phone, much less of a 64-core monster) –  Oct 13 '20 at 06:06
  • Aside from the cat issue, according to the docs, at least GNU sort should run parallel processes by default, and there's an option to change the number of processes it uses. Doesn't that built-in parallelising help? Or in other words, does it suck so bad that it's useful to do the parallelization by hand? – ilkkachu Oct 13 '20 at 11:05
  • 1
    @ilkkachu The built-in parallelization does not scale to 64 cores (I think it scales to 4 or 8 cores), and it does not work at all when doing sort -m. The inner sorts could use the built-in parallelization, but they only run at 100% for ~15 secs. So: Yes, it sucks so bad. – Ole Tange Oct 13 '20 at 16:58
  • Could you include enough instructions for someone else to replicate the results you're seeing? You say that the files are 160MB each, but don't indicate what's in them... presumably it's some amount of text, divided into sortable lines? – JigglyNaga Oct 13 '20 at 18:47
  • Also, why remove all the files? Do you still see the improvement if you replace all (rm n; sort) < n with sort n? You mention running it multiple times and picking the best, but that gets complicated when you have to copy the files all over again before each run. – JigglyNaga Oct 13 '20 at 18:54
  • @JigglyNaga Instructions to generate files that give similar timings included now. Why would it get complicated to copy the files all over again? They are in RAM. It takes less than 2 seconds to do if copied in parallel. The files are removed because they are tmp-files. – Ole Tange Oct 13 '20 at 19:44
  • 1
    "sort, delete, copy" is more complicated than "sort". Yes, temp files should be removed when they're no longer needed -- but in this situation, you're investigating performance problems, so they're still needed for the next run. Why include extra, unnecessary steps, instead of showing a Minimal Working Example? – JigglyNaga Oct 13 '20 at 21:56
  • Your edit #1 isn't the test I suggested and is inconclusive for a couple of reasons: it shortened the chain of sort commands from 6 deep to 2 reducing any compound copping that may occur and allowing a multicore CPU to run the whole thing with every process active at once. Secondly it doesn't strace the sub-shells which may be enlightening. However there's another effect of caching that wouldn't show on an strace (see my edited answer). – Philip Couling Oct 13 '20 at 23:33
  • correction 7 deep – Philip Couling Oct 13 '20 at 23:48

2 Answers2

4

This is by no means a "useless use of cat".

some_command | cat | some_command

This isn't a traditional "useless use of cat" which are usually derived from ignorance of the shell. Instead this appears to be a deliberate attempt to do something using the dynamics of cat. In this case I believe it's caching.


My Second thoughts

Even if the size of read and write isn't any different there are a couple of things which might be undetectable which could also be in play.

Firstly (and this is very important): Why is processing a sorted array faster than processing an unsorted array?. If you do anything to change the sequence in which the CPU is processing this, the timing may change. If cat succeeds in making each sort run for longer without suspending (and switching to a different process) then this could dramatically affect the CPU's branch prediction and result in a much larger or smaller time.

Secondly, even if the number and size of read is left unaffected number of times a task has to suspend (block) may be different. This in itself is likely to add or remove an overhead. So even if the reads and writes are of the same size, the cat (caching) layer might be reducing the number of times each read() and write() occurs.

Cat might simply be forcing sort to wait longer and thus have more available to do without suspending and reducing the number of times each process blocks. This would be very difficult to detect.


My First thoughts

My expectation here would be that if you put both versions in their own script and ran strace -f on each script, you would see fewer read or / write calls in the example with cat. At least, I would expect to see much larger reads at each layer using cat. My expectation for sort would be that it writes single lines and doesn't internally buffer much. Indeed I would expect it to read() in large enough blocks but only write() in single lines. This means it's not well designed for piping to itself.

As laktak points out in his answer, cat reads in blocks of 128KB (see here) but pipes typically only buffer 64KB. If I'm right then when cat is suspended waiting for a read() to complete this will give a large (128 + 64 KB) buffer for the writing sort operation to write into without ever needing to suspend. By the time cat is resumed there will be a good chunk of data (much more than sort sent in a single write) to pass onto the next sort. As a result the next sort can read from this quite a lot without being suspended.

I also suspect that adding a layer of cat closest the files would have next to no impact or negative impact on performance. These files are already cached on your ram disk. But the layers in-between calls to sort will be acting as a buffer and should reduce the number. That is the truly "useless uses of cat" are those which use cat to read from a file. That's those of the form:

cat some_file | some_command

An interesting experiment

I would be interested to know if the same effect can be induced by increasing the buffer size on the pipes. If you setup the same pipeline from a proper programming language (not a shell). Eg in C you could create your pipeline using pipe(),dup2(),fork(),exec() and call ioctl() on each pipe first to raise the buffer size (see Pipe Capacity)

1

My guess is that your use of cat throttles the throughput of each individual command which in turn lets them run faster in parallel.

cat reads your data in chunks of 128KB. Since I have no way to reproduce your test could you try to replace your usage of cat with dd to prove me right or wrong?

dd status=none bs=128K should have the same effect as cat - try to increase/decrease the block size and compare the results.

laktak
  • 5,946
  • Why would it make them run faster in parallel to be throttled? Limiting throughput usually results in slower results. dd gives comparable results to cat. Slightly faster than cat with bs=2K: 2m13s. – Ole Tange Oct 12 '20 at 21:48
  • Because then one process can't hog all the resouces - e.g. the other sorts would be waiting for input and doing nothing while one process would saturate the io. – laktak Oct 12 '20 at 21:57
  • What I/O are you talking about: This is a RAM disk. I can copy the 64 files (=10 GB) in less than 2 seconds if done in parallel. Also the inner sorts will be done sorting before they start sending data to the 63 remaining sorts. – Ole Tange Oct 12 '20 at 22:09
  • When the inner sorts are done after around 15 secs, the CPU utilization drops to around 20%: 1 process at 100%, 2 at 50%, 4 at 25% and so on. – Ole Tange Oct 12 '20 at 23:29
  • Yes, that's what I meant. The distribution of the resources between the processes isn't fair which is preventing full cpu utilization. – laktak Oct 13 '20 at 13:49
  • But there is ~80% free CPU resources. In other words: Apart from the first 15 secs, the processes get all the CPU they want. – Ole Tange Oct 13 '20 at 16:52