18

Say I have the following script:

#!/bin/bash
for i in $(seq 1000)
do
    cp /etc/passwd tmp
    cat tmp | head -1 | head -1 | head -1 > tmp  #this is the key line
    cat tmp
done

On the key line, I read and write the same file tmp which sometimes fails.

(I read it is because of race conditions because the processes in the pipeline are executed in parallel, which I do not understand why - each head needs to take the data from the previous one, doesn't it? This is NOT my main question, but you can answer it as well.)

When I run the script, it outputs about 200 lines. Is there any way I can force this script to output always 0 lines (so the I/O redirection to tmp is always prepared first and so the data is always destroyed)? To be clear, I mean changing the system settings, not this script.

Thanks for your ideas.

karlosss
  • 507

2 Answers2

26

Why there is a race condition

The two sides of a pipe are executed in parallel, not one after the other. There's a very simple way to demonstrate this: run

time sleep 1 | sleep 1

This takes one second, not two.

The shell starts two child processes and waits for both of them to complete. These two processes execute in parallel: the only reason why one of them would synchronize with the other is when it needs to wait for the other. The most common point of synchronization is when the right-hand side blocks waiting for data to read on its standard input, and becomes unblocked when the left-hand side writes more data. The converse can also happen, when the right-hand side is slow to read data and the left-hand side blocks in its write operation until the right-hand side reads more data (there is a buffer in the pipe itself, managed by the kernel, but it has a small maximum size).

To observe a point of synchronization, observe the following commands (sh -x prints each command as it executes it):

time sh -x -c '{ sleep 1; echo a; } | { cat; }'
time sh -x -c '{ echo a; sleep 1; } | { cat; }'
time sh -x -c '{ echo a; sleep 1; } | { sleep 1; cat; }'
time sh -x -c '{ sleep 2; echo a; } | { cat; sleep 1; }'

Play with variations until you're comfortable with what you observe.

Given the compound command

cat tmp | head -1 > tmp

the left-hand process does the following (I've only listed steps that are relevant to my explanation):

  1. Execute the external program cat with the argument tmp.
  2. Open tmp for reading.
  3. While it hasn't reached the end of the file, read a chunk from the file and write it to standard output.

The right-hand process does the following:

  1. Redirect standard output to tmp, truncating the file in the process.
  2. Execute the external program head with the argument -1.
  3. Read one line from standard input and write it to standard output.

The only point of synchronization is that right-3 waits for left-3 to have processed one full line. There is no synchronization between left-2 and right-1, so they can happen in either order. What order they happen in is not predictable: it depends on the CPU architecture, on the shell, on the kernel, on which cores the processes happen to be scheduled, on what interrupts the CPU receives around that time, etc.

How to change the behavior

You cannot change the behavior by changing a system setting. The computer does what you tell it to do. You told it to truncate tmp and read from tmp in parallel, so it does the two things in parallel.

Ok, there is one “system setting” you could change: you could replace /bin/bash by a different program that is not bash. I hope it would go without saying that this is not a good idea.

If you want the truncation to happen before the left-hand side of the pipe, you need to put it outside of the pipeline, for example:

{ cat tmp | head -1; } >tmp

or

( exec >tmp; cat tmp | head -1 )

I have no idea why you'd want this though. What's the point in reading from a file that you know to be empty?

Conversely, if you want the output redirection (including the truncation) to happen after cat has finished reading, then you need to either fully buffer the data in memory, e.g.

line=$(cat tmp | head -1)
printf %s "$line" >tmp

or write to a different file and then move it into place. This is usually the robust way to do things in scripts, and has the advantage that the file is written in full before it's visible through the original name.

cat tmp | head -1 >new && mv new tmp

The moreutils collection includes a program that does just that, called sponge.

cat tmp | head -1 | sponge tmp

How to detect the issue automatically

If your goal was to take badly-written scripts and automatically figure out where they break, then sorry, life isn't that simple. Runtime analysis won't reliably find the problem because sometimes cat finishes reading before the truncation happens. Static analysis can in principle do it; the simplified example in your question is caught by Shellcheck, but it may not catch a similar problem in a more complex script.

ilkkachu
  • 138,973
  • That was my goal, to determine whether the script is well written or not. If the script might have destroyed data this way, I just wanted it to destroy them every time. It is not good to hear that this is nearly impossible. Thanks to you, I now know what the problem is and will try to think up a solution. – karlosss Dec 09 '17 at 12:35
  • @karlosss: Hmm, I wonder if you could use the same system-call tracing / intercepting stuff as strace (i.e. Linux ptrace) to make all open-for-reading system calls (in all child processes) sleep for half a second, so when racing with a truncation, the truncation will almost always win. – Peter Cordes Dec 09 '17 at 14:28
  • @PeterCordes i am a newbie to this, if you can manage a way to achieve this and write it up as an answer, I will accept it. – karlosss Dec 09 '17 at 14:39
  • @PeterCordes You can't guarantee that the truncation will win with a delay. It'll work most of the time, but occasionally on a heavily loaded machine your script will fail in more or less mysterious ways. – Gilles 'SO- stop being evil' Dec 09 '17 at 20:32
  • @Gilles: Let's discuss this under my answer. – Peter Cordes Dec 09 '17 at 20:33
2

Gilles' answer explains the race condition. I'm just going to answer this part:

Is there any way I can force this script to output always 0 lines (so the I/O redirection to tmp is always prepared first and so the data is always destroyed)? To be clear, I mean changing the system settings

IDK if a tool for this already exists, but I have an idea for how one could be implemented. (But note this wouldn't be always 0 lines, just a useful tester that catches simple races like this easily, and some more complicated races. See @Gilles' comment.) It wouldn't guarantee that a script was safe, but might be a useful tool in testing, similar to testing a multi-threaded program on different CPUs, including weakly-ordered non-x86 CPUs like ARM.

You'd run it as racechecker bash foo.sh

Use the same system-call tracing / intercepting facilities that strace -f and ltrace -f use to attach to every child process. (On Linux, this is the same ptrace system call used by GDB and other debuggers to set breakpoints, single step, and modify memory / registers of another process.)

Instrument the open and openat system calls: when any process running under this tool makes a an open(2) system call (or openat) with O_RDONLY, sleep for maybe 1/2 or 1 second. Let other open system calls (especially ones including O_TRUNC) execute without delay.

This should allow the writer to win the race in nearly every race condition, unless system load was also high, or it was a complicated race condition where the truncation didn't happen until after some other read. So random variation of which open()s (and maybe read()s or writes) are delayed would increase the detection power of this tool, but of course without testing for an infinite amount of time with a delay simulator that will eventually cover all possible situations you can encounter in the real world, you can't be sure your scripts are free from races unless you read them carefully and prove they're not.


You would probably need it to whitelist (not delay open) for files in /usr/bin and /usr/lib so process-startup doesn't take forever. (Runtime dynamic linking has to open() multiple files (look at strace -eopen /bin/true or /bin/ls sometime), although if the parent shell itself is doing the truncation, that will be ok. But it will still be good for this tool to not make scripts unreasonably slow).

Or maybe whitelist every file the calling process doesn't have permission to truncate in the first place. i.e. the tracing process can make an access(2) system call before actually suspending the process that wanted to open() a file.


racechecker itself would have to be written in C, not in shell, but could maybe use strace's code as a starting point and might not take much work to implement.

You could maybe get the same functionality with a FUSE filesystem. There's probably a FUSE example of a pure passthrough filesystem, so you could add checks to the open() function in that which make it sleep for read-only opens but let truncation happen right away.

Peter Cordes
  • 6,466
  • Your idea for a race checker doesn't really work. First, there's the issue that timeouts are not reliable: one day the other guy will take longer than you expect (it's a classic issue with build or test scripts, which seem to work for a while and then fail in hard-to-debug ways when the workload expands and many things run in parallel). But beyond this, which open are you going to add a delay to? In order to detect anything interesting, you'd need to make many runs with different delay patterns and compare their results. – Gilles 'SO- stop being evil' Dec 09 '17 at 20:38
  • @Gilles: Right, any reasonably-short delay doesn't guarantee the truncate will win the race (on a heavily loaded machine as you point out). The idea here is that you use this to test your script a few times, not that you use racechecker all the time. And probably you'd want to open-for-read sleep time to be configurable for the benefit of people on very heavily loaded machines who want to set it higher, like 10 seconds. Or set it lower, like 0.1 seconds for long or inefficient scripts that re-open files a lot. – Peter Cordes Dec 09 '17 at 20:38
  • @Gilles: Great idea about different delay patterns, that might let you catch more races than just the simple within-the-same-pipeline stuff that "should be obvious (once you know how shells work)" like the OP's case. But "which opens?" any read-only open, with a whitelist or some other way to not delay process startup. – Peter Cordes Dec 09 '17 at 20:44
  • I guess you're thinking of more complex races with background jobs that don't truncate until after some other process completes? Yeah, random variation might be needed to catch that. Or maybe look at the process tree and delay "early" reads more, to try to invert the usual ordering. You could make the tool more and more complicated to simulate more and more reordering possibilities, but at some point you still have to design your programs correctly if you're doing multi-tasking. Automated testing could be useful for simpler scripts where the possible problems are more limited. – Peter Cordes Dec 09 '17 at 20:45
  • It's pretty similar to testing multi-threaded code, especially lockless algorithms: logical reasoning about why it's correct is very important, as well as testing, because you can't count on testing on any particular set of machines to produce all the reorderings that might be a problem if you haven't closed all the loopholes. But just like testing on a weakly-ordered architecture like ARM or PowerPC is a good idea in practice, testing a script under a system that artificially delays things can expose some races, so it's better than nothing. You can always introduce bugs it won't catch! – Peter Cordes Dec 09 '17 at 20:52
  • @Gilles: thanks for the feedback. I've updated the answer to be a lot less naive about how useful for real-world difficult races. It's only easy to catch simple races in non-parallel scripts (which is mostly what I was thinking about due to the simple example in the question). – Peter Cordes Dec 09 '17 at 21:03