5

I have a medium-to-large server that all of my engineering users are on. This is a 32-core, 256GB system hosting 19 Xvnc sessions plus the plethora of tools, login sessions, and such that is implied by such a userbase. All users are configured via NIS and have home directories on NFS. Various automated processes are also using NIS-defined users and NFS-mounted file systems.

The computer in question is running CentOS 6.5, and the file server involved is a NetApp.

Occasionally, after the computer has been running a while, people get intermittent issues with deleting something; the error is along the lines of "device/resource busy". lsof reveals nothing holding onto the item (file or directory) in question; after some indeterminate amount of time (usually less than is required to find an admin and have him look at the issue) the problem goes away and the item can be deleted.

Around the same time, one of the automated processes that uses SVN gets errors like this:

svn: E155009: Failed to run the WC DB work queue associated with '/home/local-user/tartarus/project8/doc/verif/verification_environment/learning/images', work item 930 (file-install doc/verif/verification_environment/learning/images/my-sequence.uml 1 0 1 1)
svn: E000018: Can't move '/home/local-user/tartarus/project8/.svn/tmp/svn-j3XrNq' to '/home/local-user/tartarus/project8/doc/verif/verification_environment/learning/images/my-sequence.uml': Invalid cross-device link

If we try to remove the offending file, we get:

rm: cannot remove `project8/doc/verif/verification_environment/learning': Device or resource busy

Googling the "Invalid cross-device link" leads to a lot of discussion about svn versions and not supporting writes across different devices, which is irrelevant to us since this generally works and we're not running cross-version svn repositories. Or cross-device repositories either, since the .svn directory is on the same device (nfs-mounted) as where the working copy lives.

Rebooting the computer makes the problem go away for weeks or months -- in my current case, the computer has just turned 185 days of uptime. But the engineers are not keen to reboot their world any more frequently than necessary.

We have excluded the file server as a cause as other computers do not have the same issue unless the issue is being manifested on the main system. That is -- we can duplicate the fact that files can't be moved/renamed if the main system can't move/rename them, however other computers never independently manifest this behaviour.

The mount options for the NFS file systems are: rw,intr,sloppy,addr=10.17.0.199

I figure this has to be a kernel value somewhere that is getting over-filled, either as a side-effect of some leaky thing the engineers are running or as a burst due to a temporary load.

It isn't total files-being-opened, because that limit is like 25M files and this computer is peaking under 200K files.

Does anyone know what I should be looking at/for?

Rui F Ribeiro
  • 56,709
  • 26
  • 150
  • 232

2 Answers2

2

Short Answer: Your local NFS doesn't think a file or directory is there. (yeah, you kinda suspected that)

NFS is an old technology. It wasn't meant for high-access, rapidly changing files. For dynamic shared-file systems, try a clustered solution like OCFS2 (my fave) or gluster (meh, Dark Side).

Several years ago we had 4 servers mounting a common NFS and found repeatedly a server would create a file that the other servers couldn't see. The 4 servers were web app servers. A user would initiate an action that would have a server create a package and update a row in the database with the NFS path to the file when complete. The user's browser would keep checking back in every 10 sec to see if the job was done, and if it was the file would download. You can see the problem coming - the server would update the row in the DB that the file was there, but another server would get the request from the user's browser - it would go to read the file and get "file not found" errors.

As you said, by the time an admin looked at it, the file was there. It took weeks of several of us engineers digging and digging to find the problem. Basically we ran a 10-sec sleep loop that would get the last created file path as indicated in the database and try to ls the file into a log. Consistently the system that created the file could see it but the others couldn't for some period of time. That gap of time was longer as the load on the servers increased.

The pointy-haired bosses didn't want to change the underlying NFS to a clustered file system, so instead we also had the work server save that "he" was the one that created the file in the db. The user's request would keep retrying until the job was done and the request came to the server that created the file so that it would always be there to read. Yeah, I know. Kludge. But that's what you get when you decide to keep old technology - you have to kludge to get things to work. The old tech is the first kludge and everything done to work with it is just more kludge. Welcome back to the 80s and Max Headroom's FS of choice.

NFS doesn't keep all the clients in sync with all the changes in real time. So you'll constantly run into conditions where one client creates a file/directory and another one can't see it, or where one client deletes a file/directory and other clients think it's still there (until they try to use it - oops).

We tried all kinds of tricks to get the systems to resync their client cache before trying to read the file. Not happening.

My recommendation: bring your FS into this century. (try the flux capacitor @ 88mph)

Dev Null
  • 163
  • I've seen the nfs problem you describe. The problem is that the nfs client is caching negative lookup results -- ie when it goes to open a file, the file isn't there, and it caches that answer. You're stuck with that answer until the cache times out. I've solved it by mounting nfs with the option lookupcache=positive set. In my case this activity is all happening on the same machine, so it isn't a caching problem. – David Mackintosh Aug 12 '15 at 13:35
  • try lookupcache=none. using lookupcache=positive means it won't invalidate the cache until the parent dir cache expires. lookupcache=none is really bad since now there is no local cache for any operation and it will really wear the nfs server out, but I don't think you have any other option since you're using NFS. – Dev Null Aug 13 '15 at 15:09
0

just a remark:

the error E155009/E000018 also applies to local cross-device moves:

svn: E155009: Failed to run the WC DB work queue associated with '/first-device/mounted-device', work item 1219 (file-install /first-device/mounted-device/file-to-move-to 1 0 1 1)
svn: E000018: Can't move '/first-device/.svn/tmp/svn-v2KRIt' to '/first-device/mounted-device/file-to-move-to': Invalid cross-device link

As such, it is not NFS specific.