14

I use a Linux 4.x-based distribution, and I've recently noticed the kernel's open() system call supports an O_PATH open flag.

While the man page for it does have a list of system calls it could theoretically be used with, I don't quite understand what the idea is. Do I open(O_PATH) only directories, rather than files? And if I do, why do I want to use a file descriptor instead of the directory's path? Also, most of the system calls listed there don't seem to be particular to directories; so, do I also open regular files with O_PATH to somehow get their directory as a file descriptor? Or to get a file descriptor for them but with limited functionality?

Can someone give a cogent explanation of what O_PATH is about and how, and what for, we're supposed to use it?

Notes:

  • No need to describe the history of how this evolved (the relevant man pages mention changes in Linux 2.6.x, 3.5 and 3.6) unless necessary - I just care about how things are now.
  • Please don't tell me to just use libc or other higher-level facilities, I know that.
einpoklum
  • 9,515

1 Answers1

14

The description in the open(2) man page gives some clues to start with:

   O_PATH (since Linux 2.6.39)
          Obtain a file descriptor that can be used for two purposes:
          to  indicate  a location in the filesystem tree and to per‐
          form operations that act  purely  at  the  file  descriptor
          level.  The file itself is not opened, and other file oper‐
          ations  (e.g.,  read(2),  write(2),  fchmod(2),  fchown(2),
          fgetxattr(2), ioctl(2), mmap(2)) fail with the error EBADF.

Sometimes, we don't want to open a file or a directory. Instead, we just want a reference to that filesystem object in order to perform certain operations (e.g., to fchdir() to a directory referred to by a file descriptor that we opened using O_PATH). So, a trivial point: if this is our purpose, then opening with O_PATH should be a little cheaper, since the file itself is not actually opened.

And a less trivial point: before the existence of O_PATH, the way of obtaining such a reference to a filesystem object was to open the object with O_RDONLY. But the use of O_RDONLY requires that we have read permission on the object. However, there are various use cases where we don't need to actually read the object: for example, executing a binary or accessing a directory (fchdir()) or reaching through a directory to touch an object inside the directory.

Usage with "*at()" system calls

The common, but not the only, use of O_PATH is to open a directory, in order to have a reference to that directory for use with the "*at" system calls, such as openat(), fstatat(), fchownat(), and so on. This family of system calls, which we can roughly think of as the modern successors to the older system calls with similar names (open(), fstat(), fchown(), and so on), serve a couple of purposes, the first of which you touch on when you ask "why do I want to use a file descriptor instead of the directory's path?". If we look further down in the open(2) man page, we find this text (under a subheading with the rationale for the "*at" system calls):

   First,  openat()  allows  an  application to avoid race conditions
   that could occur when using open() to open  files  in  directories
   other  than  the current working directory.  These race conditions
   result from the fact that some component of the  directory  prefix
   given  to  open()  could  be  changed in parallel with the call to
   open().  Suppose, for example, that we wish  to  create  the  file
   path/to/xxx.dep  if  the  file path/to/xxx exists.  The problem is
   that between the existence check and the file creation step,  path
   or  to  (which might be symbolic links) could be modified to point
   to a different location.  Such races can be avoided by  opening  a
   file descriptor for the target directory, and then specifying that
   file descriptor as the dirfd argument of (say) fstatat(2) and ope‐
   nat().

To make this more concrete... Suppose we have a program that wants to perform multiple operations in a directory other than its current working directory, meaning that we must specify some directory prefix as part of the filenames we use. Suppose, for example, that the pathname is /dir1/dir2/file and we want to perform two operations:

  1. Perform some check on /dir1/dir2/file (e.g., who owns the file, or what time was it last modified).
  2. If we are satisfied with the result of that check, perhaps we then want to do some other filesystem operation in the same directory, for example, creating a file called /dir1/dir2/file.new.

Now, first suppose we did everything using traditional pathname-based system calls:

struct stat stabuf;
stat("/dir1/dir2/file", &statbuf);
if ( /* Info returned in statbuf is to our liking */ ) {
    fd = open("/dir1/dir2/file.new", O_CREAT | O_RDWR, 0600);
    /* And then populate file referred to by fd */
}

Now, furthermore suppose that in the directory prefix /dir1/dir2 one of the components (say dir2) was actually a symbolic link (that refers to a directory), and that between the call to stat() and the call to open() a malicious person was able to change the target of the symbolic link dir2 to point to a different directory. This is a classic time-of-check-time-of-use race condition. Our program checked a file in one directory but was then tricked into creating a file in a different directory -- perhaps a security-sensitive directory. The key point here is that the pathname /dir/dir2 looked the same, but what it refers changed completely.

We can avoid these sorts of problems using the "*at" calls. First of all, we obtain a handle referring to the directory where we will do our work:

dirfd = open("/dir/dir2", O_PATH);

The critical point here is that dirfd is a stable reference to the directory that was referred to by the path /dir1/dir2 at the time of the open() call. If the target of the symbolic link dir2 is subsequently changed, this will not affect what dirfd refers to. Now, we can do our check + operation using the "*at" calls that are equivalent to the stat() and open() calls above:

fstatat(dirfd, ""file", &statbuf)
struct stat stabuf;
fstatat(dirfd, "file", &statbuf);
if ( /* Info returned in statbuf is to our liking */ ) {
    fd = openat(dirfd, "file.new", O_CREAT | O_RDWR, 0600);
    /* And then populate file referred to by fd */
}

During these steps any manipulation of symbolic links in the pathname /dir/dir2 will have no impact: the check (fstatat()) and the operation (openat()) are guaranteed to take place in the same directory.

There is another purpose to using the "*at()" calls, which relates to the idea of "per-thread current working directories" in multithreaded programs (and again we could open the directories using O_PATH), but I think this use is probably less relevant to your question, and I leave you to read the open(2) man page if you'd like to know more.

Usage with file descriptors for regular files

One usage of O_PATH with regular files is to open a binary for which we have execute permission (but not necessarily read permission, so that we could not open the file with O_RDONLY). That file descriptor can then be passed to fexecve(3) to execute the program. All that fexecve(fd, argv, envp) is doing with its fd argument is essentially:

snprintf(buf, "/proc/self/fd/%d", fd);
execve(buf, argv, envp);

(Although, starting with glibc 2.27, the implementation will instead make use of the execveat(2) system call, on kernels that provide that system call.)

mtk
  • 617
  • The problem is that between the existence check and the file creation step, path or to ... could be modified - can't parse this sentence. But I get the gist of it, I think. So it's serves as a sort of a locking mechanism on a directory. But why use the open() result rather than an actual lock? – einpoklum Sep 16 '17 at 13:12
  • @einpoklum the problem is that 'path' and 'to' don't have the formatting shown in the original man page. These are components of the hypothetical pathname "/path/to/xxx". And, it's not like a lock: it's a stable reference to a filesystem object; several programs might have such reference to the same object. – mtk Sep 16 '17 at 13:29
  • The directory(/symlink/file) could, in fact, be unlinked (rm/unlink()) while you hold a reference/FD to it. However, the FD is (indirectly) a pointer to the actual blocks on disk, which will continue to be treated as "in-use" until the FD is closed - even though the name referencing them no longer exists. Waiting for lock = not doing anything: better to avoid locks. – DimeCadmium Aug 14 '20 at 01:56