Preventing broken NFS connection from freezing the client system

Question

We have an NFS 4 share, sharing a volume between a number of servers (NFS server, and clients all Debian 8). We have had some issues recently where network outages would freeze the client systems.

Our NFS options were minimal, just rw (and so the defaults hard, fg, etc).

I'm now experimenting with these options, but am not getting the behaviour I expect: rw,soft,bg,retrans=6,timeo=150

(I've increased the retrans to offset some of the soft risk)

The procedure I'm following to test is :

Boot machine
cd to /mnt/mountpoint
Verify NFS connection ok
cd /
kill network ifdown eth0
cd to /mnt/mountpoint
ls

At this point the command line freezes, and I can't interupt it. After some time the message 'nfs: server [servername] not responding, timed out`, which seems to repeat once a minute (indefinately).

What I would like/expect to happen for the operation to fail, and return control.

Please could someone tell me where I'm going wrong with these settings?

(PS: I also tried mounting with autofs, but saw similar behaviour)

Thank you

I wouldn't recommend soft under any circumstances. It allows data to be discarded on error. Instead I'd suggest hard,intr. — Chris Davies, Mar 02 '16 at 17:59
@roaima - Thanks. That opinion seems to be very prevalent on the web :) Trouble is the current situation we have with hard is just as bad for us (systems dying and staying dead until rebooted). intr is not supported in NFS4 according to man. — UpTheCreek, Mar 02 '16 at 18:01
(Correction, it seems intr it's supported by NFS4, but not by kernels > 2.6.25) — UpTheCreek, Mar 02 '16 at 19:16
I think that the thing that differs from the 'standard' answers is that you're changing current working directory to the mount point. Do you get the same behaviour without the cd, but instead doing ls /mnt/mountpoint? It's possible that after the ls fails, your shell is attempting filesystem operations dependent on PWD. (Even worse, if you were foolish enough to put . in your $PATH) — Toby Speight, Mar 14 '16 at 10:03

BowlOfRed · Answer 1 · 2016-03-02T18:41:19.257

intr should allow for you to get control again when you hit ^C, but usually not immediately.

   intr           If an NFS file operation has a major timeout and it is hard mounted, then allow signals to interupt the
                  file  operation  and cause it to return EINTR to the calling program.  The default is to not allow file
                  operations to be interrupted.

As you say, expectations are the problem here. Network problems can be temporary, but failing an operation is permanent. So most operations default to simply blocking until the operation completes.

This is the standard answer, but looking at a current man page I see this:

                  The  intr / nointr mount option is deprecated after ker-
                  nel 2.6.25.  Only SIGKILL can interrupt  a  pending  NFS
                  operation on these kernels, and if specified, this mount
                  option is ignored  to  provide  backwards  compatibility
                  with older kernels.

So it doesn't appear to me to be a NFS3/NFS4 issue, but a decision about how intr works. So you should be able to KILL the process, but that may not give you much utility.

I was unable to find the discussion about why the option was removed. Can you kill -KILL your process?

Thanks, but according to man intr is supported by nfs 2/3 but not 4. — UpTheCreek, Mar 02 '16 at 18:00
@UpTheCreek, I don't understand why that would be. I don't have a debian system here but it's explicitly mentioned as available. Have you tried it? "intr This will allow NFS4 operations (on hard mounts) to be interrupted while waiting for a response from the server." — BowlOfRed, Mar 02 '16 at 18:12
Yeah, I tried it, and it didn't seem to have any effect. Man says it's ignored in recent kernel versions. — UpTheCreek, Mar 02 '16 at 18:25
It is not possible to KILL a process because the entire system freezes. No commands can be issued in my experience. (Although it might be possible to SSH into such a frozen machine in some cases.) — MountainX, Aug 26 '17 at 02:28

Chris Davies · Answer 2 · 2017-08-11T10:40:54.260

3

Some of my answer is opinion, based on experience. Where I have facts I will (try to remember to) link to them.

NFS 4 is considered an improvement over versions 2 and 3. However, I've not yet seen a strong use case for needing the improvements. Perhaps that's because I aim to export filesystems to Windows clients with Samba and to Unix/Linux clients with NFS.
I wouldn't recommend soft under almost any circumstances. It allows data to be discarded on error. Instead I'd suggest hard,intr.
As you point out, intr isn't valid for NFS 4, but it seems that this is a kernel change rather than an NFS one.
The NFS Automounter (autofs) works well for my use cases with NFS versions 2 and 3, and manages to help protect my client systems from server failure by mounting the NFS filesystems only while they are required.

My suggestion to you would be to consider moving from NFS 4 to NFS 3 and see if that helps your particular use case. Don't think of it as a downgrade.

edited Aug 11 '17 at 10:40

answered Mar 02 '16 at 18:14

Chris Davies

116,213
16
160
287

1

Thanks, but I'm not able to switch to NFS3, and even if I were, as you say, intr is not supported on recent kernel versions. – UpTheCreek Mar 02 '16 at 18:36
2

Ah yeah looks like intr is supported in NFS4 (it's listed in both the 2/3 only options and the 4 only options in man, which is a bit confusing), but just not supported in recently kernel versions. – UpTheCreek Mar 02 '16 at 19:17
1

"I wouldn't recommend soft under any circumstances" - really? In my case, I have a busy web server that mounts an images directory. If the images host goes down and we use hard, the whole website goes down. If we use soft, we might possibly get a few broken images (though our caching system mitigates this almost completely). The risk of soft allowing file corruption really isn't that much of a big deal. I'd much rather have one corrupt image file than a site down! – Doug McLean Aug 11 '17 at 08:53
1

@DougMcLean having also been in a similar situation (busy web farm, image servers, NFS...) I would say it's a somewhat specialised case. If my image servers had been as unreliable I suspect I might well have settled for soft as an acceptable solution. Answer modified from "never" to "almost never". Thanks! – Chris Davies Aug 11 '17 at 10:48
1

If my memory is correct, this system freeze problem was also present in NFS v3. – MountainX Aug 26 '17 at 02:30

Preventing broken NFS connection from freezing the client system

2 Answers2

Linked