4

Summary: I'm trying to understand exactly what the following error message means:

[17016.923750] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[17016.923758] {4}[Hardware Error]: It has been corrected by h/w and requires no further action
[17016.923759] {4}[Hardware Error]: event severity: corrected
[17016.923761] {4}[Hardware Error]:  Error 0, type: corrected
[17016.923762] {4}[Hardware Error]:  fru_text: CorrectedErr
[17016.923764] {4}[Hardware Error]:   section_type: memory error

Details:

I have a server with an Intel(R) Xeon(R) CPU E3-1275 v3 @ 3.50GHz CPU that is running Arch Linux (3.18.6-1-ARCH #1 SMP PREEMPT Sat Feb 7 08:44:05 CET 2015 x86_64 GNU/Linux).

When I run dmesg I see the error that I posted above. The errors are not that frequent, but they do seem to keep happening. For instance the server has been up for 1 day now since the last reboot, and there are 9 instance of this error listed in the log.

I saw another question that asked about this error and there was an answer that suggested the problem was that the ECC memory is failing.

My questions are:

1) Is there any reference to support the idea that this error message is associated with ECC memory?

2) If I do have a failing DIMM is there a suggested way to figure out which one it is? I tried running memtest86+, but it did not report any memory errors.

3) If the OS reports ECC errors have been corrected does that really mean the DIMM is failing?

I wouldn't be so concerned if the only problem was a few messages in my log file. But I have also noticed that sometimes the server hangs unexpectedly. The machine is being used for research and it's not as important for it to be stable as it would be if it were a production system. Still having the machine hang can be problematic. So I would like to know exactly what this error message means, and if I need to replace a component it would be nice if there were a way to figure out which component needs replacement.

Edit

Currently the server has been up for 8 days without hanging and I see 148 instances of this error message in the logs. In addition I see one instance of the following message:

[671211.188084] EDAC MC0: INTERNAL ERROR: csrow value is out of range (6 >= 4)
[671211.188333] EDAC MC0: 1 CE ie31200 CE on unknown memory (channel:1 page:0x0 offset:0x0 grain:0 syndrome:0xc8)

I guess it is likely that one of the DIMMs has a problem. Still I would be interested to know in case anyone had any information about how to interpret these messages, in particular in order to figure out which DIMM is possibly failing.

2 Answers2

1

FYI I appeared to have a similar issue as this. It was on a Xeon with a Debian recently upgraded from Wheezy to Jessie.

As it turned out the solution was taking the memory out, and reseating it, and everything was back to normal.

0

From what I have read this error is normal. Has to do with UEFI. Needs a kernel change to get rid of the error but apparently it's harmless.

fred
  • 111