3

Question

Suppose I have some non-directory (file, named pipe/socket, whatever) at the pathname /tmp/foo and some other non-directory at the pathname /tmp/bar. Then two (or more) processes start executing concurrently:

Process one does:

unlink('/tmp/foo') /* or rename('/tmp/foo', '/tmp/removed') */
unlink('/tmp/bar') /* or rename('/tmp/bar', '/tmp/removed') */

Process two (and so on) does:

link('/tmp/foo', '/tmp/bar')

As I understand it, there is no way process two could possibly succeed (either the link(2) is attempted while /tmp/foo is still present, in which case /tmp/bar is also present so it must fail with EEXIST, or /tmp/foo is gone so is must fail with ENOENT).

But this intuition relies on the assumption that the unlink(2) and/or rename(2) system calls are inherently sequential in their unlinking effects, so I am looking for verification of my understanding: Is there any *nix-like system out there whose kernel allows the two unlink(2) and/or rename(2) calls to succeed, but simultaneously causes link(2) to succeed as well (whether due to re-order the unlinking of /tmp/foo and /tmp/bar and not abstracting/hiding that from the process calling link(2), or through through some other quirky race condition/bug)?

Current Understanding

I have read the manpages for unlink(2), rename(2), and link(2) for Linux and a few BSDs, and the POSIX specification for these functions. But I don't think they actually contain anything reassuring on this matter, upon careful consideration. At least with rename(2), we're promised that the destination is atomically replaced if it's already present (bugs in the OS itself aside), but nothing else.

I have seen claims that multiple simultaneous executions of rename(foo, qux) will atomically and portably have all but one rename fail with ENOENT - so that's promising! I am just uncertain if that can be extended to having a link(foo, bar) fail with ENOENT under the same circumstances as well.

Preferred Answers

I realize that this is one of those "can't prove a negative" situations - we can at best only note that there is no evidence that a *nix-like system which will allow process two's link(2) to succeed exists.

So what I'm looking for is answers covering as many *nix-like systems as possible (at least Linux, OS X, and the various BSDs, but ideally also the proprietary still-in-some-use systems like Solaris 10) - from people who have sufficient familiarity with these systems and this narrow set of problems (atomic/well-ordered file system operations) that they're confident (as much as one realistically can be) that they'd know of issues like the aforementioned Mac OS X rename(2)-not-actually-atomic bug if they existed on the platforms they're familiar with. That would give me enough confidence that this works the way I think it does in a portable-enough manner to rely on.

Final Note

This isn't an "X/Y problem" question - there's no underlying problem that can be answered by referring me to the various locking/IPC mechanisms or something else that works around the uncertainty about how these particular system calls interact: I specifically want to know if one can rely on the above system calls portably interacting as expected across *nix-like systems in practical use today.

mtraceur
  • 1,166
  • 9
  • 14
  • 1
    I don't have vast knowlege of the problem, but I think the answer partially depends on the I/O scheduler. While noop scheduler really performs those operations in the order they come, the default CFQ scheduler is way more creative and you can't assume to guess what order it will apply. – Barafu Albino Oct 02 '16 at 21:05
  • 1
    @BarafuAlbino No, this has nothing to do with the I/O scheduler. The I/O scheduler has an impact on resilience, but not on atomicity. The I/O scheduler's behavior is not observable by applications except in terms of performance. – Gilles 'SO- stop being evil' Oct 02 '16 at 21:17
  • 1
    "...assumption that the system calls are inherently sequential..." -- those system calls do one thing only, each. You're doing two system calls which are going to be handled in order (otherwise it would be very easy to come up with sequences that fail). Of course what gets written on disk might be different, but you rarely get portable guarantees about behaviour after a system crash. – ilkkachu Oct 02 '16 at 21:19
  • 1
    (Though, sure, doing unlink(foo) + unlink(bar) would leave the ordering of the system calls up to the the compiler..) – ilkkachu Oct 02 '16 at 21:22
  • 1
    @BarafuAlbino: Thanks - I guess it's worth clarifying that in that case what matters is whether the kernel "hides" the implementation detail of actual I/O re-ordering from processes that are simultaneously calling link(2). Since when two processes call rename(2) on the same source path, the kernel can keep track and causes one of them to fail with ENOENT, even if the I/O for the operation isn't scheduled to be done until later (and most *nix-likes seem to do just that - it's just a matter of whether that logical consistency extends to unlink(2) and link(2)). – mtraceur Oct 02 '16 at 21:22
  • @Gilles and @ilkkachu: Thanks - it's reassuring that it is indeed sensible to expect that userspace should be able to expect the effects to appear to happen in the order that the system calls succeed. One scenario I was envisioning that worried me was if some *nix-like kernel was implemented in a way that started handling the link, checked that foo existed, then took an interrupt, context switched and handled the two unlinks from the other process, then got around to finishing handling the link where it left off - but I'm likely overly paranoid no sane kernel would do such a thing. – mtraceur Oct 02 '16 at 21:32

1 Answers1

5

Look at standards such as POSIX for portability guarantees. In practice, most POSIX-compliant systems have minor deviations from the specifications, but generally speaking you can rely on the guarantees given in the specification. Most modern unices comply with the specification even if they haven't been formally tested. They may need to be run in a POSIX mode, e.g. setting POSIXLY_CORRECT=1 with bash or making sure that /usr/xpg4/bin is ahead of /bin and /usr/bin in PATH on Solaris.

Single Unix v2 (an older extension of POSIX) has this to say about link:

The link() function will atomically create a new link for the existing file and the link count of the file is incremented by one.

About rename:

If the link named by the new argument exists, it is removed and old renamed to new. In this case, a link named new will remain visible to other processes throughout the renaming operation and will refer either to the file referred to by new or old before the operation began.

POSIX explicitly states that if the destination exists, its replacement must be atomic. It does not however state that the renaming itself must be atomic, i.e. that there is no point in time when both old and new refer to the file in question, or when neither does. In practice, those properties are true on unix systems, at least with local filesystems.

Furthermore the order of operations is guaranteed: in C, ; guarantees sequential execution; in sh, ;/newline guarantees sequential execution (as do && and so on); other programming languages offer similar guarantees. So in

unlink("/tmp/foo");
unlink("/tmp/bar");

it is guaranteed that there is no point in time when /tmp/foo exists but not /tmp/bar (assuming that /tmp/bar exists initially). Therefore a concurrent process executing link("/tmp/foo", "/tmp/bar") cannot succeed.

Note that atomicity does not guarantee resilience. Atomicity is about observable behavior on a live system. Resilience, in the context of filesystems, is about what happens in case of a system crash. Many filesystems sacrifice resilience for performance, so if the execution of unlink("foo"); unlink("bar"); is interrupted (with the current directory on on-disk storage), it is possible that bar will be deleted and foo will remain behind.

Some network filesystems give fewer guarantees when operations happen on different clients. Older implementations of NFS were notorious for this. I think modern implementations are better but I have no experience of modern NFS.

  • I should have noted that I did read the POSIX pages as well. Note that POSIX unlink page has no mention of atomicity - I see nothing that would prohibit an implementation from atomically creating a link (in the sense that either the link is made or it's not) requested with link(foo, bar) even a significant delay after another process successfully called unlink(foo), especially if the file was simultaneously open somewhere. That would be the right, sensible thing to do, but that doesn't preclude bad implementations. – mtraceur Oct 02 '16 at 21:49
  • Also, I do have reasonable confidence that POSIX is mostly a reliable measuring stick of system behavior, but as the atomicity bug in Mac OS X's rename(2) that I referred to in my question shows, sometimes broken systems arise, so in some way this question is a call for anyone who has run into similar errors (as they pertain to unlink/link/rename interaction) to note them. Still, a +1 for the answer: my comments aside, it's still a helpful answer. – mtraceur Oct 02 '16 at 21:55
  • @mtraceur unlink only does one thing. Atomicity means doing several things at the same time; with only one thing, it's moot. – Gilles 'SO- stop being evil' Oct 02 '16 at 22:25
  • 1
    Well, atomicity means being effectively indivisible - so things that are just "one thing" in some sense are perhaps implicitly considered atomic. But at what level does unlink stop being considered "one thing"? Anyway, Cygwin is actually a great counter-example: not a *nix-like system, but it strives for POSIX compliance as I understand it - yet their implementation of unlink will actually fake-succeed and queue the unlink operating until the thing being unlinked is closed - sure it's a workaround for a Windows limitation, but is it actually a violation of POSIX? – mtraceur Oct 03 '16 at 00:24
  • 1
    By the way, what I meant by "at what level does unlink stop being [...] 'one thing'" is that at the end of the day, unlink involves multiple machine operations - file-system lookup (possibly cached in memory, possibly involving disk IO), changing filesystem metadata (both cached and on disk, not necessarily at the same time, and I'm not sure that's always just one memory operation and just one disk IO operation), and all of this is possibly interrupted/preempted by other kernel tasks and later resumed along the way. – mtraceur Mar 19 '17 at 05:47