Hardware error from APEI Generic Hardware Error Source (ECC RAM)

Question

[58306.633900] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[58306.633905] {1}[Hardware Error]: It has been corrected by h/w and requires no further action
[58306.633907] {1}[Hardware Error]: event severity: corrected
[58306.633909] {1}[Hardware Error]:  Error 0, type: corrected
[58306.633911] {1}[Hardware Error]:  fru_text: CorrectedErr
[58306.633912] {1}[Hardware Error]:   section_type: memory error
[58306.633914] {1}[Hardware Error]:   node: 0 device: 44696
[58306.633916] {1}[Hardware Error]:   error_type: 2, single-bit ECC

This has appeared on my Debian Xeon server with ECC RAM, does it mean the RAM modules are dying or something else like an error caused by SW for example? I saw some other post claiming his OS rebooted, while mine didn't, which is why I am asking. Thank you.

score 1 · Accepted Answer · answered Mar 17 '22 at 17:11

ECC memory errors are always hardware errors, not software errors. That doesn’t mean that they indicate failing hardware, they can be caused by random bit flips. (Google’s 2009 paper on the topic provides interesting insights; its citations might provide more recent analyses.)

Hardware bit flips can be triggered by software, e.g. in Rowhammer attacks.

Unless the ECC errors become frequent, or you start seeing uncorrectable ECC errors, I wouldn’t worry about it.

Hardware error from APEI Generic Hardware Error Source (ECC RAM)

1 Answers1