Why is using "|" or "&" provide different sizes of the result file?

Question

Here is my explanation:

strings *.bin > bin.txt | sort -n bin.txt > logs1.txt

that gives me logs1.txt, but it has 621 KB.
strings *.bin > bin.txt & sort -n bin.txt > logs1.txt

that gives me logs1.txt, but it has 0 KB.
strings *.bin > bin.txt
sort -n bin.txt > logs2.txt

these commands provide the file logs2.txt in 586,853 KB.

Note the bin.txt size is 586,853 KB, meaning just running the 3 options gives me the same size as the bin.txt, and I want to know the reason.

Kusalananda · Answer 1 · 2022-06-17T09:31:05.280

Some of the details in this answer assume that the user uses a shell other than zsh. The details differ slightly for the zsh shell due to its MULTIOS feature.

strings *.bin > bin.txt | sort -n bin.txt > logs1.txt

This runs strings *.bin and redirects the result to bin.txt. At the same time as strings starts, sort is started and sorts the file bin.txt. The pipe has no function at all in this pipeline other than allowing the two commands to run concurrently.

Usually, the pipe is used to transfer the standard output of the left side's command to the standard input of the right side's command, but since both commands read from a file, the pipe is never used.

Since both strings and sort are started concurrently, sort may possibly find the end of the bin.txt file before strings has finished writing the whole file. It's fairly random how much data sort will end up reading, if anything.

Correct use of the pipe would have looked like
```
strings -- *.bin | sort -n > logs1.txt
```
Here, strings writes directly to the input of sort rather than to a file, and sort reads from the output of strings instead of from a file.

The right-hand side of the pipe will be temporarily blocked if the left-hand side does not produce data fast enough, and the left-hand side will be temporarily blocked if the right-hand side can't consume the data fast enough. This way, the two utilities are synchronized, and you are guaranteed that sort will read the full output of strings.
strings *.bin > bin.txt & sort -n bin.txt > logs1.txt

This suffers from the same issue as the previous command as both strings and sort are started concurrently. The & starts strings in the background and sort is started immediately after that. Both utilities write to or read from bin.txt independently of each other, and it's chance that decides how much of the file is written before sort reaches its end.
strings *.bin > bin.txt followed by sort -n bin.txt > logs2.txt.

Here, you manually synchronize the two utilities by allowing strings to finish writing to the intermediate file bin.txt before using sort to sort its contents. There is no issue, and you are guaranteed that sort will be able to read the complete output of strings from the file.

Summary: Your first two commands do not synchronize the strings and sort utilities. The writing by strings is independent of the reading by sort. This means sort may find the end of the intermediate file before strings is finished writing all data. This, in turn, means you may get an incomplete end result. The amount of data your incomplete result will contain depends on chance.

The fact that the two utilities are started concurrently with each other also means that sort may even read a pre-existing bin.txt to the end before the shell even has time to truncate the file and start strings.

Solution: Write all data to the intermediate file first, then read from it, as in your third example. Or, allow the two utilities to communicate the data directly between themselves using the pipe, as in my suggestion above:

strings -- *.bin | sort -n > logs1.txt

Or to keep a copy of the unsorted strings output for future reference:

strings -- *.bin | tee bin.txt | sort -n > logs1.txt

Further relevant reading here on U&L:

re. "but since both commands read from a file, the pipe is never used.", also since the first command writes to a file. If it was just sort foo.txt | sort bar.txt, the pipe would still be written to, and if foo.txt as large enough, the pipeline would hang since nothing was reading from the pipe (small amounts of data would just disappear in the pipe buffers). — ilkkachu, Jun 17 '22 at 08:50

Why is using "|" or "&" provide different sizes of the result file?

1 Answers1