1

I am running Wind River Linux 4.3 on an embedded Freescale single board computer (SBC). Output from uname -a:

Linux ge101 2.6.34.10-myname-grsec-WR4.3.0.0_cgl #1 SMP PREEMPT Mon Aug 26 01:35:35 PDT 2013 ppc ppc ppc GNU/Linux

There are several modules / processes running and creating log files that I am interested in collecting:

  1. A kernel module logging to /var/log/messages
  2. A userspace application which is having its stdout redirected to a log file. This application is automatically started upon bootup by an init.d script. The script does this by the following: nohup ${executable_name} -C ${configfile_name} >> ${logfile} 2>&1 &
  3. This same userspace application writes other data to a separate file on the SBC's filesystem.
  4. A Python script using the logging module to write to a log file.

I am seeing a bizarre problem where these 4 log files occasionally are "corrupted" in the following way:

  • Anywhere between 78-309 NUL characters are inserted into the log file
  • Text from another log file (40 or so lines) gets inserted into it
  • When this happens, there is a chunk of data missing from the log file

Additionally, there are some instances where only the NUL characters are inserted, and no text from other logs gets intermingled in.

In a log file on the order of 100k lines, this behavior occurs 7-9 times.

The NUL characters by themselves would be fairly easy to remove during post-processing. However, the unexpected data showing up from other files is breaking parsing scripts. And of course, the missing chunk of data is the most troublesome.

Does anyone know what might cause this bizarre behavior to occur?

  • I've seen this on systems (non-embedded) that have crashed. Unfortunately, log files are often in use, so when a system crashes they are thus among the most likely to get corrupted. It can be ameliorated by changing your source code to call fsync() after each important write or by using appropriate file modes to make the writes get forced to disk frequently. – Mark Plotnick Jan 07 '14 at 20:05
  • Working theory right now is some kind of file handle corruption. Thoughts..? – tony_tiger Jan 22 '14 at 01:28
  • Not sure. Are the boundaries of the erroneous chunks on round 512-byte boundaries? Aside from the instances where the corruption was caused by the system crashing or losing power during heavy io, I think we saw this once and concluded it was bad cache on a disk controller. Can you run fsck (or its equivalent) on the filesystem? Test the system's RAM for errors (or just swap it out)? – Mark Plotnick Jan 22 '14 at 01:52

0 Answers0