Tracking huge buffer usage under Linux

Question

I have a Linux server running under Debian Lenny with 4Go of RAM. It doesn't run much, just:

Postfix/spamassassin (daemon mode)
Bind9
KVM (one guest - 1Go of RAM for it)

Every day at exactly 3:05 UTC, the server completely drop to ground floor almost all of its memory. After that, I have more than 2G used by buffer and never cleaned up (unless I manually tell the kernel to drop the cache).

I'v searched the web a lot and, at the begining, I though this was due to NFS buffer usage. I do backups over an NFS share drive using gzip/tar and the backup occured at 3:05.

However, I'm now in a very strange situation because I moved the backup task at 1:40 (it completes in 2 mins) and I still drop all the RAM at 3:05.

In my logs, nothing particular, except that at 03:05:01, cron open a session as root and immediately close it at 03:05:02 without doing anything. Of course, cron has been restarted and I checked the timing of the tasks - again, nothing particular.

Any idea why this happens? Or, any idea about how to track what's using all those buffers?

Edit: The server is running in UTC, all times here are UTC. It's not running any NFS server and I don't have mlocate or slocate installed. For the crontab, daily and user crontabs, nothing is done at that hour.

Here's the interesting part in my logs about cron:

auth.log-20110501:May 1 03:05:01 SRV CRON[15914]: pam_unix(cron:session): session opened for user root by (uid=0)
auth.log-20110501:May 1 03:05:01 SRV CRON[15914]: pam_unix(cron:session): session closed for user root
syslog-20110501:May 1 03:05:01 SRV /USR/SBIN/CRON[15915]: (root) CMD ([ -x /usr/lib/sysstat/sa1 ] && { [ -r "$DEFAULT" ] && . "$DEFAULT" ; [ "$ENABLED" = "true" ] && exec /usr/lib/sysstat/sa1 $SA1_OPTIONS 1 1 ; })

Is UTC the server timezone? Are you sure there's no cron jobs run at that time? e.g. makewhatis/man-db and slocate/mlocate do a lot of disk access. — Mikel, May 01 '11 at 02:48
And can you add the relevant part of your cron log to the question please? — Mikel, May 01 '11 at 02:58
Is this server also an NFS server? Maybe there is a backup running on a different host that is using NFS to talk to your server. — Mikel, May 01 '11 at 03:06

Mikel · Answer 1 · 2011-05-01T03:22:42.107

Is your server running in UTC? Most would run in a local time zone, which I'm guessing is UTC+1 (MET or CET) for you. I ask because we need to know what timezone is used in the crontab, e.g. maybe 3.05 isn't called 3.05 in the crontab.

Common tasks that do a lot of disk access include makewhatis/man-db and slocate/mlocate. Double check they're not being run around 3.05, e.g. look at /etc/crontab and /etc/cron.daily. Check user cron tabs in /var/spool/cron.

There's two ways I can think of to figure out what's running at 3.05 without the cron logs.

The first is auditctl, which is a bit painful to use.

Based on the man page, I would try something like:

$ sudo auditctl -a entry,always -S open -S creat \
    -S read -S readv -S write -S writev -S sendfile \
    -S fork -S clone -S execve -k 305

to set up the auditing, then

$ sudo aureport -s -i -ts 03:04 -te 03:06

when you log in to the system after 3.05 to check what happened.

The second is a simple ps. Just write a script to run ps several times, and schedule it to run at 3.04.

Many aspects of ps can be useful here, for example, wchan and status fields to see which processes are doing I/O and pcpu to see which processes are using the most CPU at the time. Of course, any process that doesn't appear in the list at 3.04.59 but does appear at or shortly after 3.05.01 is an obvious suspect as well.

I'll track this using the auditctl command as suggested – Henry-Nicolas Tourneur May 01 '11 at 10:56 — Henry-Nicolas Tourneur, May 01 '11 at 10:56

Tracking huge buffer usage under Linux

1 Answers1

Linked