I'm running an Ubuntu 12.04 derivative (amd64) and I've been having really strange issues recently. Out of the blue, seemingly, X will freeze completely for a while (1-3 minutes?) and then the system will reboot. This system is overclocked, but very stable as verified in Windows, which leads me to believe I'm having a kernel panic or an issue with one of my modules. Even in Linux, I can run LINPACK and won't see a crash despite putting ridiculous load on the CPU. Crashes seem to happen at random times, even when the machine is sitting idle.
How can I debug what's crashing the system?
On a hunch that it might be the proprietary NVIDIA driver, I reverted all the way down to the stable version of the driver, version 304 and I still experience the crash.
Can anyone walk me through a good debugging procedure for after a crash? I'd be more than happy to boot into a thumb drive and post all of my post-crash configuration files, I'm just not sure what they would be. How can I find out what's crashing my system?
Here are a bunch of logs, the usual culprits.
.xsession-errors: http://pastebin.com/EEDtVkVm
/var/log/Xorg.0.log: http://pastebin.com/ftsG5VAn
/var/log/kern.log: http://pastebin.com/Hsy7jcHZ
/var/log/syslog: http://pastebin.com/9Fkp3FMz
I can't even seem to find a record of the crash at all.
Triggering the crash is not so simple, it seem to happen when the GPU is trying to draw multiple things at once. If I put on a YouTube video in full screen and let it repeat for a while or scroll through a ton of GIFs and a Skype notification pops up, sometimes it'll crash. Totally scratching my head on this one.
The CPU is overclocked to 4.8GHz, but it's completely stable and has survived huge LINPACK runs and 9 hours of Prime95 yesterday without a single crash.
Update
I've installed kdump
, crash
, and linux-crashdump
, as well as the kernel debug symbols for my kernel version 3.2.0-35. When I run apport-unpack
on the crashed kernel file and then crash
on the VmCore
crash dump, here's what I see:
KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-35-generic
DUMPFILE: Downloads/crash/VmCore
CPUS: 8
DATE: Thu Jan 10 16:05:55 2013
UPTIME: 00:26:04
LOAD AVERAGE: 2.20, 0.84, 0.49
TASKS: 614
NODENAME: mightymoose
RELEASE: 3.2.0-35-generic
VERSION: #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012
MACHINE: x86_64 (3499 Mhz)
MEMORY: 8 GB
PANIC: "[ 1561.519960] Kernel panic - not syncing: Fatal Machine check"
PID: 0
COMMAND: "swapper/5"
TASK: ffff880211251700 (1 of 8) [THREAD_INFO: ffff880211260000]
CPU: 5
STATE: TASK_RUNNING (PANIC)
When I run log
from the crash
utility, I see this at the bottom of the log:
[ 1561.519943] [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000800400
[ 1561.519946] [Hardware Error]: RIP !INEXACT! 33:<00007fe99ae93e54>
[ 1561.519948] [Hardware Error]: TSC 539b174dead ADDR 3fe98d264ebd MISC 1
[ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28
[ 1561.519951] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519953] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 3: be00000000800400
[ 1561.519955] [Hardware Error]: TSC 539b174de9d ADDR 3fe98d264ebd MISC 1
[ 1561.519957] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 0 microcode 28
[ 1561.519958] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519959] [Hardware Error]: Machine check: Processor context corrupt
[ 1561.519960] Kernel panic - not syncing: Fatal Machine check
[ 1561.519962] Pid: 0, comm: swapper/5 Tainted: P M C O 3.2.0-35-generic #55-Ubuntu
[ 1561.519963] Call Trace:
[ 1561.519964] <#MC> [<ffffffff81644340>] panic+0x91/0x1a4
[ 1561.519971] [<ffffffff8102abeb>] mce_panic.part.14+0x18b/0x1c0
[ 1561.519973] [<ffffffff8102ac80>] mce_panic+0x60/0xb0
[ 1561.519975] [<ffffffff8102aec4>] mce_reign+0x1f4/0x200
[ 1561.519977] [<ffffffff8102b175>] mce_end+0xf5/0x100
[ 1561.519979] [<ffffffff8102b92c>] do_machine_check+0x3fc/0x600
[ 1561.519982] [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519984] [<ffffffff8165d78c>] machine_check+0x1c/0x30
[ 1561.519986] [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519987] <<EOE>> [<ffffffff81509697>] ? menu_select+0xe7/0x2c0
[ 1561.519991] [<ffffffff815082d1>] cpuidle_idle_call+0xc1/0x280
[ 1561.519994] [<ffffffff8101322a>] cpu_idle+0xca/0x120
[ 1561.519996] [<ffffffff8163aa9a>] start_secondary+0xd9/0xdb
bt
outputs the backtrace:
PID: 0 TASK: ffff880211251700 CPU: 5 COMMAND: "swapper/5"
#0 [ffff88021ed4aba0] machine_kexec at ffffffff8103947a
#1 [ffff88021ed4ac10] crash_kexec at ffffffff810b52c8
#2 [ffff88021ed4ace0] panic at ffffffff81644347
#3 [ffff88021ed4ad60] mce_panic.part.14 at ffffffff8102abeb
#4 [ffff88021ed4adb0] mce_panic at ffffffff8102ac80
#5 [ffff88021ed4ade0] mce_reign at ffffffff8102aec4
#6 [ffff88021ed4ae40] mce_end at ffffffff8102b175
#7 [ffff88021ed4ae70] do_machine_check at ffffffff8102b92c
#8 [ffff88021ed4af50] machine_check at ffffffff8165d78c
[exception RIP: intel_idle+191]
RIP: ffffffff8136d48f RSP: ffff880211261e38 RFLAGS: 00000046
RAX: 0000000000000020 RBX: 0000000000000008 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880211261fd8 RDI: ffffffff81c12f00
RBP: ffff880211261e98 R8: 00000000fffffffc R9: 0000000000000f9f
R10: 0000000000001e95 R11: 0000000000000000 R12: 0000000000000003
R13: ffff88021ed5ac70 R14: 0000000000000020 R15: 12d818fb42cfe42b
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <MCE exception stack> ---
#9 [ffff880211261e38] intel_idle at ffffffff8136d48f
#10 [ffff880211261ea0] cpuidle_idle_call at ffffffff815082d1
#11 [ffff880211261f00] cpu_idle at ffffffff8101322a
Any ideas?
tail -f /var/log/kern.log
running and try to catch it that way. – ott-- Jan 07 '13 at 22:17/var/log/kern.log
, but now looking intosyslog
. – Naftuli Kay Jan 08 '13 at 04:09ssh myhost "tail -f /var/log/kern.log"
outputs nothing, even when it crashes. – Naftuli Kay Jan 10 '13 at 21:44tail -f
, not as one command. – ott-- Jan 10 '13 at 22:04linux-crashdump
my/var/crash
was empty after a kernel panic test – auraham Oct 25 '14 at 07:48mcelog --ascii
as suggested and copy-paste the previous log lines, removing the timestamps like[ 1561.519946]
... I get some more details listed. It says "MCA: Internal Timer error". (mcelog also says "Hardware event. This is not a software error."). – sourcejedi Aug 04 '18 at 20:18