How can I assess why my computer is shutting down?

Question

I have an old laptop (around 10 years old, maybe), on which I have a minimal install of Debian 10. I use it to download and store media files, which I reproduce from other machines on my home network. I generally keep its lid closed, and access it through ssh. I've had it doing this for around a year, and it generally runs smoothly — excluding a random crash once every month or so, maybe. Recently, though, it started crashing way more frequently: between once a week, to sometimes within minutes to an hour of me booting it and getting everything up and running, or even during boot.

I've ran memtest86+ and a SMART test, and both reported no problems. I also checked the core's temperature, and it seemed to not be the problem either. Like I said, this is an old laptop, so it may be that something has just reached its end of life, but I'd like to make sure that's the case...

What else should I be looking at to assess the reason(s) for these random crash/shutdowns? I'm interested in figuring out what if this is a hardware or software problem, and how I can solve it — or, alternatively, which parts of the computer are potentially still salvageable.

Also happy to dump whatever extra info is needed here :)

As per this comment, pasting the output of dmesg --level=alert,crit,err,warn:

[    0.225970] ACPI BIOS Warning (bug): Incorrect checksum in table [ATKG] - 0xB0, should be 0x4A (20180810/tbprint-177)
[    0.362067] core: PEBS disabled due to CPU errata
[    0.363544] mtrr: your CPUs had inconsistent variable MTRR settings
[    0.424461] Expanded resource Reserved due to conflict with PCI Bus 0000:00
[    3.474163] Unstable clock detected, switching default tracing clock to "global"
               If you want to keep using the local clock, then add:
                 "trace_clock=local"
               on the kernel command line
[    3.728460] ACPI Warning: SystemIO range 0x0000000000000828-0x000000000000082F conflicts with OpRegion 0x0000000000000800-0x000000000000084F (\PMIO) (20180810/utaddress-213)
[    3.728473] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x000000000000053F (\GPIO) (20180810/utaddress-213)
[    3.728481] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x000000000000053F (\GPIO) (20180810/utaddress-213)
[    3.728488] lpc_ich: Resource conflict(s) found affecting gpio_ich

Have you inspected dmesg for any clues? Have you run sensors to check whether your CPU is not overheating? — Artem S. Tashkinov, Aug 28 '20 at 11:18
I've checked sensors, yes, and that doesn't seem to be the problem. I'm checking dmesg now, but I'm a bit of a noob and am not sure what I should be looking for — got any pointers, @ArtemS.Tashkinov? Or should I should paste its output into the question body? — Marcy, Aug 28 '20 at 11:32
Run dmesg --level=alert,crit,err,warn to see only what's "bad" ;-) You may paste it into your question, yes. — Artem S. Tashkinov, Aug 28 '20 at 11:39
There's nothing unusual in your dmesg output. Perhaps you're looking at a HW failure but I've no idea how to diagnose it. Also, I presume you've rebooted/powered on just recently, so the errors are yet to appear. — Artem S. Tashkinov, Aug 28 '20 at 11:42
Do you have any ideas of where to start looking to assess if it's some program that's causing the crash instead, @ArtemS.Tashkinov? — Marcy, Aug 28 '20 at 11:44
Programs usually cannot crash the system. Either you have a HW failure or your kernel craps out but in the latter case you won't normally see it in your dmesg output because at the point of a crash the kernel is unable to log anything to the disk - at most errors could be seen on the device screen. — Artem S. Tashkinov, Aug 28 '20 at 11:47
I see. I was thinking maybe a program was causing it because I've only observed it crashing after I get my stuff to take care of downloads (basically what I was trying to set up in CentOS here a while back, but on Debian instead, and being manually started rather than at boot) running — wanted to know if there's a way to assess causation there, but your comment seems to suggest that's not possible. — Marcy, Aug 28 '20 at 11:50
If it's crashing regularly, disable console blanking and monitor power saving (setterm) and next time it happens make a photo of your screen. — Artem S. Tashkinov, Aug 28 '20 at 11:54
I should clarify (if it was not clear already, in which case apologies) that the computer completely shuts down. So there's no freeze, and the monitor just goes black, with no error message that I've seen. But I'll keep a closer eye on it — unless I misunderstood what you meant in your latest comment? — Marcy, Aug 28 '20 at 11:56
the computer completely shuts down - that surely indicates a HW failure. Even if the kernel or some app crash, your system will keep on running in a broken state. — Artem S. Tashkinov, Aug 28 '20 at 11:58
I see. I also was misremembering — the computer sometimes just goes down even while booting... — Marcy, Aug 28 '20 at 12:19
Depending on how hard you want to work at preserving that hardware, I'd swap memory. While memory testers can find memory errors, in my experience they aren't reliable ways to determine memory is failing. In other words, I've had failed memory that passed memory tests, but when memory was replaced, faults went away. Also, sadly, if the computer is crashing, unless you go to extreme lengths to log interactively to a different system, there is a very high likelihood that the helpful logs will be lost because they don't get preserved on disk due to the crash. — kbulgrien, Aug 28 '20 at 12:21
Thanks for the input, @kbulgrien. Would the same be true of a hard drive — ie., the SMART test showing all's well when in fact it maybe isn't? — Marcy, Aug 28 '20 at 12:25
I've never observed SMART to do that. Besides, a hard drive failure generally won't take a system down so hard/fast that you can't see what's happening, and anyway, there would surely be evidence of failing sectors in logs as I've never seen a hard drive fail so spectacularly all at once and intermittently on a regular basis. I'd guess there is pretty much zero chance of it being a hard drive issue. — kbulgrien, Aug 28 '20 at 12:29
Thanks for the extra info, @kbulgrien. I guess I need to assess whether replacing the memory is worth it, or if maybe I should just grab the HDD and use it elsewhere... — Marcy, Aug 28 '20 at 12:31
I take my systems apart and clean them. Unless this is a laptop, tearing it apart and putting it back together can help. Reseating all cards in slots is a good thing on old hardware. Probably can't really hurt to disconnect reconnect all cables. Make sure all cooling fans and air flow are unobstructed. Wiggle power connectors to make sure none are twitchy and CPU/GPU heats sink tight. I've had power supply cables that were badly designed do stuff like that and now stay away from at least one brand because of that. None of this is rocket science, but it can fix some things. — kbulgrien, Aug 28 '20 at 12:34
It is a laptop, but maybe opening it up is the best next step, given all the comments here — it has been around a year since I last opened it up to clean it up and replace the thermal paste, so maybe some maintenance cleaning is in order :| — Marcy, Aug 28 '20 at 12:35
Oh, a laptop... well, good luck with that... they are far more susceptible to issues... being dropped, breaking things when cables are involved in breaking falls, etc. Make very sure that the vents are unobstructed. I have a laptop that has to sit well above the surface it is sitting on, or it overheats and does very bad things. — kbulgrien, Aug 28 '20 at 12:39
In my experience RAM either works or doesn't - swapping it is very unlikely to help. — Artem S. Tashkinov, Aug 28 '20 at 12:43

kbulgrien · Answer 1 · 2020-08-28T13:11:17.830

There's a good chance that such faults are hardware related, though there could conceptually be some driver issue involved. It's hard to come up with a procedure to follow to diagnose this.

One should definitely should comb logs for clues, but, sadly, when the computer goes down so fast as seems to be described, logs are often not helpful as they aren't reliably written or retained on disk. If you really want to chase this down, probably logging to a remote host is something to consider so that any messages are captured on a system that isn't crashing.

As some comments indicate, use tools like:

dmesg
sensors
dmesg --level=alert,crit,err,warn
journalctl -xe / journalctl --full and/or examine files in /var/logs

For remote logging, look into rsyslog (or some other agent with similar capabilities).

Depending on how hard you want to work at preserving that hardware, swap memory. While memory testers can find memory errors, in my experience they aren't reliable ways to determine memory is failing. In other words, I've had failed memory that passed memory tests, but when memory was replaced, faults went away. (I've definitely seen memory in more than one system that was intermittent - it is uncommon though.)

A hard drive failure generally won't take a system down so hard/fast that you can't see what's happening, and anyway, there would surely be evidence of failing sectors in logs as I've never seen a hard drive fail so spectacularly all at once and yet intermittently on a regular basis. I'd guess there is pretty much zero chance of it being a hard drive issue.

I take my systems apart and clean them. Unless this is a laptop, tearing it apart and putting it back together can help. Reseating cards/RAM in slots is a good thing on old hardware. Disconnecting/reconnecting cables might help. Make sure all cooling fans and air flow are unobstructed. Wiggle power connectors to make sure none are twitchy and CPU/GPU heats sink tight. I've had power supply cables that were badly designed do stuff like that and randomly take the system down (I now stay away from at least one brand because of that). None of this is rocket science, but it can fix some things.

If it is a laptop... well, good luck with that... they are far more susceptible to issues... being dropped, breaking/cracking things when cables are involved in breaking falls, etc. Cracked circuit boards make for really great "random" problem sources. Make very sure that the vents are unobstructed. I have a laptop that has to sit well above the surface it is sitting on, or it overheats and does very bad things, but I wouldn't really expect heat to cause a crash during boot.

How can I assess why my computer is shutting down?

1 Answers1