I have a medium-to-large server that all of my engineering users are on. This is a 32-core, 256GB system hosting 19 Xvnc sessions plus the plethora of tools, login sessions, and such that is implied by such a userbase. All users are configured via NIS and have home directories on NFS. Various automated processes are also using NIS-defined users and NFS-mounted file systems.
The computer in question is running CentOS 6.5, and the file server involved is a NetApp.
Occasionally, after the computer has been running a while, people get intermittent issues with deleting something; the error is along the lines of "device/resource busy". lsof reveals nothing holding onto the item (file or directory) in question; after some indeterminate amount of time (usually less than is required to find an admin and have him look at the issue) the problem goes away and the item can be deleted.
Around the same time, one of the automated processes that uses SVN gets errors like this:
svn: E155009: Failed to run the WC DB work queue associated with '/home/local-user/tartarus/project8/doc/verif/verification_environment/learning/images', work item 930 (file-install doc/verif/verification_environment/learning/images/my-sequence.uml 1 0 1 1)
svn: E000018: Can't move '/home/local-user/tartarus/project8/.svn/tmp/svn-j3XrNq' to '/home/local-user/tartarus/project8/doc/verif/verification_environment/learning/images/my-sequence.uml': Invalid cross-device link
If we try to remove the offending file, we get:
rm: cannot remove `project8/doc/verif/verification_environment/learning': Device or resource busy
Googling the "Invalid cross-device link" leads to a lot of discussion about svn versions and not supporting writes across different devices, which is irrelevant to us since this generally works and we're not running cross-version svn repositories. Or cross-device repositories either, since the .svn directory is on the same device (nfs-mounted) as where the working copy lives.
Rebooting the computer makes the problem go away for weeks or months -- in my current case, the computer has just turned 185 days of uptime. But the engineers are not keen to reboot their world any more frequently than necessary.
We have excluded the file server as a cause as other computers do not have the same issue unless the issue is being manifested on the main system. That is -- we can duplicate the fact that files can't be moved/renamed if the main system can't move/rename them, however other computers never independently manifest this behaviour.
The mount options for the NFS file systems are: rw,intr,sloppy,addr=10.17.0.199
I figure this has to be a kernel value somewhere that is getting over-filled, either as a side-effect of some leaky thing the engineers are running or as a burst due to a temporary load.
It isn't total files-being-opened, because that limit is like 25M files and this computer is peaking under 200K files.
Does anyone know what I should be looking at/for?
dmesg
show a lot of timeouts/reconnects? – Bratchley May 24 '15 at 00:20