7

For some time I've been experiencing a strange issue with NFS where a seemingly random subset of directories (always the same ones) under / consistently show up with stale file handles immediately after NFS mount.

I've been able to correct the problem by explicitly exporting the seemingly-random set of problem directories, but I'd like to see if I can fix things more completely so I don't have to occasionally add random directories to the export table.

Below, I mount a filesystem, show that there are no open file handles, run ls, and rerun lsof. Empty lines added between commands for clarity:

# mount -t nfs -o vers=4,noac,hard,intr 192.168.0.2:/ /nfs -vvv
mount.nfs: trying text-based options 'vers=4,noac,hard,intr,addr=192.168.0.2,clientaddr=192.168.0.4'
192.168.0.2:/ on /nfs type nfs (rw,vers=4,noac,hard,intr)

lsof | grep /nfs

ls -lh /nfs

ls: cannot access /nfs/usr: Stale file handle ls: cannot access /nfs/root: Stale file handle ls: cannot access /nfs/etc: Stale file handle ls: cannot access /nfs/home: Stale file handle lrwxrwxrwx 1 root root 7 Mar 27 2017 bin -> usr/bin drwxr-xr-x 6 root root 16K Jan 1 1970 boot drwxr-xr-x 438 i336 users 36K Feb 28 12:12 data drwxr-xr-x 2 root root 4.0K Mar 14 2016 dev d????????? ? ? ? ? ? etc d????????? ? ? ? ? ? home lrwxrwxrwx 1 root root 7 Mar 27 2017 lib -> usr/lib lrwxrwxrwx 1 root root 7 Mar 27 2017 lib64 -> usr/lib drwxr-xr-x 15 root root 4.0K Oct 15 15:51 mnt drwxr-xr-x 2 root root 4.0K Aug 9 2017 nfs drwxr-xr-x 14 root root 4.0K Jan 28 17:00 opt dr-xr-xr-x 2 root root 4.0K Mar 14 2016 proc d????????? ? ? ? ? ? root drwxr-xr-x 2 root root 4.0K Mar 14 2016 run lrwxrwxrwx 1 root root 7 Mar 27 2017 sbin -> usr/bin drwxr-xr-x 6 root root 4.0K Jun 22 2016 srv dr-xr-xr-x 2 root root 4.0K Mar 14 2016 sys drwxrwxrwt 2 root root 4.0K Dec 10 2016 tmp d????????? ? ? ? ? ? usr drwxr-xr-x 15 root root 4.0K May 24 2017 var

lsof | grep /nfs

The subdirectories in question are not mount points; they seem completely normal:

$ ls -dlh /usr /root /etc /home
drwxr-xr-x 123 root root  12K Mar  3 13:34 /etc
drwxr-xr-x   7 root root 4.0K Jul 28  2017 /home
drwxrwxrwx  32 root root 4.0K Mar  3 13:55 /root
drwxr-xr-x  15 root root 4.0K Feb 24 17:48 /usr

There are no related errors in syslog about these directories. The only info that does show up mentions a different set of directories:

... rpc.mountd[10080]: Cannot export /proc, possibly unsupported filesystem or fsid= required
... rpc.mountd[10080]: Cannot export /dev, possibly unsupported filesystem or fsid= required
... rpc.mountd[10080]: Cannot export /sys, possibly unsupported filesystem or fsid= required
... rpc.mountd[10080]: Cannot export /tmp, possibly unsupported filesystem or fsid= required
... rpc.mountd[10080]: Cannot export /run, possibly unsupported filesystem or fsid= required

Here's what /etc/exports currently looks like:

/ *(rw,subtree_check,no_root_squash,nohide,crossmnt,fsid=0,sync)

The server side is running Arch Linux and is currently on kernel 4.10.3.

The client-side is Slackware 14.1 with kernel 4.1.6.

i336_
  • 1,017
  • Regarding the bounty (sorry it's not a lot ) - I'm looking for help/ideas on what I might be able to try and poke in order to figure out what's going on here. My own efforts fairly promptly hit a brick wall when I discovered nfsd couldn't be straced as it's a kernel process. Maybe there's something I can do with BPF or similar...? Or perhaps Wireshark might come in handy? – i336_ Mar 06 '18 at 02:03
  • Experienced a slightly similar issue during this week and which seem to be caused by the NFS server. Had to reboot the NFS server which resolved it finally. Does the logs on Arch Linux (4.10.3.) show something suspicious? – U880D Mar 06 '18 at 16:39
  • 1
    I would consider reviewing how the / directory is exported on the NFS server. Moreover, if there is just a few directories you are trying to access, consider mounting them separately. Alternatively, if you need to access files under /home, for instance, try exporting and mounting that as an experiment. I should say, I have no idea what the issue is. However, just "messing around" can sometimes yield useful insights. Of course, you can try these things I've suggested and come up with nothing anyway. – Mark Mar 06 '18 at 23:16
  • Did you read this: https://serverfault.com/questions/617610/stale-nfs-file-handle-after-reboot ? – thomas Mar 07 '18 at 15:12
  • Did you reproduce the problem from another client? Can you try with another kernel? – Patrick Mevzek Mar 08 '18 at 00:48
  • This doesn't solve everything but proved tangentially noteworthy: some other (unlisted) paths that went stale pointed to discrete mountpoints. If I forget to mount these before starting NFS (I do everything manually), no amount of un/remounting will make the paths un-stale - I have to kill nfsd, mount everything, then start nfsd again. So this at least restores access to my USB HDD, but doesn't explain why /home remains stale. – i336_ Mar 09 '18 at 01:31
  • @PatrickMevzek: I've long wondered (somewhat cynically, heh) whether trying a different kernel version would fix things. I suspect it just might. It would be a last resort though; that's a level of fussing around/maintenance I admittedly hate. – i336_ Mar 09 '18 at 01:34
  • @Mark: I actually had separate listings for /home, /etc, etc in my exports file for months. A few days ago everything fell apart and I decided "alright let's see if we can fix this for good." I suspect the "fix" will be switching to SMB, heh... – i336_ Mar 09 '18 at 01:38
  • 1
    @i336_ I agree that doing so (testing another kernel -- note that I suggested another client also, since in your example the delta between both kernels is not small) does not look like sexy, but it depends on your goal: either to get the real final word on pinpointing precisely what created the problem (this is the research task that may not be bounded by time) or find the quickest solution to the problem (a pragmatic engineer task) in order to go forward with other stuff. Both are sensible but will not go the same route nor for the same amount of time and energy. – Patrick Mevzek Mar 09 '18 at 01:45
  • Perhaps the NFS server is started to early during boot so that the export runs in parallel to other mounts? Also, have you tried bind-mounting your / export to /srv/nfs4/root (or whatever) and exporting from there? This is often done (though for other reasons) and may possibly help here. – Ned64 May 18 '19 at 10:28
  • The problem still seems to persist. After a crash of my secondary server (its rootfs is mounted on the primary via NFS) and a mandatory installation from scratch exporting the secondary's / directory with crossmnt had triggered these ominous stale file handle warnings when attempting to access the /boot or /srv directories (both are mount points), and nothing that I have tried would solve the problem. I had to work around this by exporting those mounts with nohide set. – Robidu Mar 19 '24 at 23:11

1 Answers1

0

Your exports file looks abnormal for NFS 4:

/ *(rw,subtree_check,no_root_squash,nohide,crossmnt,fsid=0,sync)

Instead I think you need to follow the Arch Linux instructions for that fsid=0 line. It declares a special export, ‘the so-called NFS root.’

Then declare your own exports on subsequent lines as the instructions show. You can export the server’s root file system — not to be confused with the NFS root — as shown in this old Gentoo post.