2

AFAIK the container terminology, what I'm essentially trying to accomplish is to write my own "container runtime".

What I'm doing:

user@host:~$ mkdir test
user@host:~$ cd test
user@host:~/test$ mkdir dev
user@host:~/test$ mkdir proc
user@host:~/test$ echo 1 |sudo tee /proc/sys/kernel/unprivileged_userns_clone
user@host:~$ unshare --ipc --mount --net --pid --uts --cgroup --user \
  --map-root-user --fork bash
root@host:~/test# mount none -t tmpfs dev/
root@host:~/test# touch dev/zero
root@host:~/test# mount /dev/zero -o bind dev/zero
root@host:~/test# echo 1 > dev/zero
bash: dev/zero: Permission denied
root@host:~/test# ls -lah dev
total 4.0K
drwxrwxrwt 2 root   root      60 Sep  1 15:12 .
drwxr-xr-x 3 root   root    4.0K Sep  1 13:47 ..
crw-rw-rw- 1 nobody nogroup 1, 5 Sep  1 13:55 zero
root@host:~/test# mount # we are still looking at hosts /proc
<...>
none on /home/user/test/dev type tmpfs (rw,relatime,uid=1000,gid=1000)
udev on /home/user/test/dev/zero type devtmpfs (rw,nosuid,relatime,size=3921088k,nr_inodes=980272,mode=755)
root@host:~/test# mount none -t proc proc/
root@host:~/test# cat proc/mounts
<...>
none /home/user/test/dev tmpfs rw,relatime,uid=1000,gid=1000 0 0
udev /home/user/test/dev/zero devtmpfs rw,nosuid,relatime,size=3921088k,nr_inodes=980272,mode=755 0 0
none /home/user/test/proc proc rw,relatime 0 0

Echo-ing to dev/zero produces error. Can anybody enlighten me as to what I'm doing wrong?

I took the idea from dockers runc(libcontainer): https://github.com/docker/runc/blob/ae2948042b08ad3d6d13cd09f40a50ffff4fc688/libcontainer/rootfs_linux.go#L463

This question might be relevant: -bash: /dev/null: Permission denied

os: Debian buster kernel: 4.19.37

lynx
  • 67
  • 1
  • 6
  • @mosvy: You are right. My bad. Reading does work. I've tried to simplify too much(reading) when I noticed that writing fails and forgot to replicate the behavior of example(second mistake). I've updated the question with writing(I don't think it's worth opening a new one). – lynx Sep 01 '19 at 20:15
  • I don't think it's related, /dev/null is indeed a character device on your system. – 炸鱼薯条德里克 Sep 01 '19 at 23:57

1 Answers1

3

Quick fix for your problem: mount the tmpfs with a mode= option (which sets the permissions of the root dir of the filesystem), and use a mode without either the sticky bit (01000) or the write permission for others (002) -- which are both on by default:

root@host:~/test# mount -o mode=0755 -t tmpfs none dev/

You're not losing anything; having the /dev directory inside your container sticky writable by everybody wasn't a good idea in the first place (it would be interesting to know if docker is doing that, too ;-))


This happens because of the tmpfs mount inside the user namespace: the kernel sets the owner of the root dir of the tmpfs filesystem from the untranslated credentials of the process doing the mount (thence the uid=1000,gid=1000 in the mount(1) output), and its permissions to the default 01777 (rwx for everybody + the sticky bit).

The combination of a non-root owner, write permissions for others (S_IWOTH) and a sticky bit (S_ISVTX) on a directory triggers a curious behaviour in newer Linux kernels: trying to open(2) an existing character device inside such a directory will fail with EACCES if the O_CREAT flag is used, even if done by the owner of the directory:

$ mkdir doo; chmod 1777 doo 
$ su -c 'mknod -m 666 doo/null c 1 3'  # or touch doo/null; mount -B /dev/null doo/null 
$ echo > doo/null
bash: doo/null: Permission denied

$ perl -MFcntl -e 'sysopen(NULL, "doo/null", O_WRONLY|O_CREAT, 0666) or die $!'
Permission denied at -e line 1.
$ perl -MFcntl -e 'sysopen(NULL, "doo/null", O_WRONLY, 0666) or die $!'
$ # ok!

$ chmod -t doo
$ echo > doo/null
$ # ok!

This is reproducible on Debian Buster / 4.19.0-5 and on a more recent kernels, but not on Debian Stretch / 4.9.0-4.

The behavior was introduced as a side effect of this commit:

commit 30aba6656f61ed44cba445a3c0d38b296fa9e8f5
Author: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Date:   Thu Aug 23 17:00:35 2018 -0700

    namei: allow restricted O_CREAT of FIFOs and regular files
...
diff --git a/fs/namei.c b/fs/namei.c
...
+static int may_create_in_sticky(struct dentry * const dir,
+                               struct inode * const inode)
+{
+       if ((!sysctl_protected_fifos && S_ISFIFO(inode->i_mode)) ||
+           (!sysctl_protected_regular && S_ISREG(inode->i_mode)) ||
+           likely(!(dir->d_inode->i_mode & S_ISVTX)) ||
+           uid_eq(inode->i_uid, dir->d_inode->i_uid) ||
+           uid_eq(current_fsuid(), inode->i_uid))
+               return 0;
+
+       if (likely(dir->d_inode->i_mode & 0002) ||
+           (dir->d_inode->i_mode & 0020 &&
+            ((sysctl_protected_fifos >= 2 && S_ISFIFO(inode->i_mode)) ||
+             (sysctl_protected_regular >= 2 && S_ISREG(inode->i_mode))))) {
+               return -EACCES;
+       }
+       return 0;
+}

The doo/null will not clear the first if because it is neither a fifo, nor a regular file, the containing dir really has its sticky bit set and its uid differs from that of the dir and from that of the process trying to open it.

It will immediately match the second if because the containing dir has the 002 (write-for-others) bit set.

I haven't found in the lkml discussions where this was introduced (1,2,3) whether this effect was even considered. Anyways, putting character devices inside world-writable sticky directories is unlikely to be intentional.