28

I'm running an Ubuntu 12.04 derivative (amd64) and I've been having really strange issues recently. Out of the blue, seemingly, X will freeze completely for a while (1-3 minutes?) and then the system will reboot. This system is overclocked, but very stable as verified in Windows, which leads me to believe I'm having a kernel panic or an issue with one of my modules. Even in Linux, I can run LINPACK and won't see a crash despite putting ridiculous load on the CPU. Crashes seem to happen at random times, even when the machine is sitting idle.

How can I debug what's crashing the system?

On a hunch that it might be the proprietary NVIDIA driver, I reverted all the way down to the stable version of the driver, version 304 and I still experience the crash.

Can anyone walk me through a good debugging procedure for after a crash? I'd be more than happy to boot into a thumb drive and post all of my post-crash configuration files, I'm just not sure what they would be. How can I find out what's crashing my system?

Here are a bunch of logs, the usual culprits.

.xsession-errors: http://pastebin.com/EEDtVkVm

/var/log/Xorg.0.log: http://pastebin.com/ftsG5VAn

/var/log/kern.log: http://pastebin.com/Hsy7jcHZ

/var/log/syslog: http://pastebin.com/9Fkp3FMz

I can't even seem to find a record of the crash at all.

Triggering the crash is not so simple, it seem to happen when the GPU is trying to draw multiple things at once. If I put on a YouTube video in full screen and let it repeat for a while or scroll through a ton of GIFs and a Skype notification pops up, sometimes it'll crash. Totally scratching my head on this one.

The CPU is overclocked to 4.8GHz, but it's completely stable and has survived huge LINPACK runs and 9 hours of Prime95 yesterday without a single crash.

Update

I've installed kdump, crash, and linux-crashdump, as well as the kernel debug symbols for my kernel version 3.2.0-35. When I run apport-unpack on the crashed kernel file and then crash on the VmCore crash dump, here's what I see:

      KERNEL: /usr/lib/debug/boot/vmlinux-3.2.0-35-generic
    DUMPFILE: Downloads/crash/VmCore
        CPUS: 8
        DATE: Thu Jan 10 16:05:55 2013
      UPTIME: 00:26:04
LOAD AVERAGE: 2.20, 0.84, 0.49
       TASKS: 614
    NODENAME: mightymoose
     RELEASE: 3.2.0-35-generic
     VERSION: #55-Ubuntu SMP Wed Dec 5 17:42:16 UTC 2012
     MACHINE: x86_64  (3499 Mhz)
      MEMORY: 8 GB
       PANIC: "[ 1561.519960] Kernel panic - not syncing: Fatal Machine check"
         PID: 0
     COMMAND: "swapper/5"
        TASK: ffff880211251700  (1 of 8)  [THREAD_INFO: ffff880211260000]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)

When I run log from the crash utility, I see this at the bottom of the log:

[ 1561.519943] [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 3: be00000000800400
[ 1561.519946] [Hardware Error]: RIP !INEXACT! 33:<00007fe99ae93e54> 
[ 1561.519948] [Hardware Error]: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 
[ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28
[ 1561.519951] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519953] [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 3: be00000000800400
[ 1561.519955] [Hardware Error]: TSC 539b174de9d ADDR 3fe98d264ebd MISC 1 
[ 1561.519957] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 0 microcode 28
[ 1561.519958] [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1561.519959] [Hardware Error]: Machine check: Processor context corrupt
[ 1561.519960] Kernel panic - not syncing: Fatal Machine check
[ 1561.519962] Pid: 0, comm: swapper/5 Tainted: P   M     C O 3.2.0-35-generic #55-Ubuntu
[ 1561.519963] Call Trace:
[ 1561.519964]  <#MC>  [<ffffffff81644340>] panic+0x91/0x1a4
[ 1561.519971]  [<ffffffff8102abeb>] mce_panic.part.14+0x18b/0x1c0
[ 1561.519973]  [<ffffffff8102ac80>] mce_panic+0x60/0xb0
[ 1561.519975]  [<ffffffff8102aec4>] mce_reign+0x1f4/0x200
[ 1561.519977]  [<ffffffff8102b175>] mce_end+0xf5/0x100
[ 1561.519979]  [<ffffffff8102b92c>] do_machine_check+0x3fc/0x600
[ 1561.519982]  [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519984]  [<ffffffff8165d78c>] machine_check+0x1c/0x30
[ 1561.519986]  [<ffffffff8136d48f>] ? intel_idle+0xbf/0x150
[ 1561.519987]  <<EOE>>  [<ffffffff81509697>] ? menu_select+0xe7/0x2c0
[ 1561.519991]  [<ffffffff815082d1>] cpuidle_idle_call+0xc1/0x280
[ 1561.519994]  [<ffffffff8101322a>] cpu_idle+0xca/0x120
[ 1561.519996]  [<ffffffff8163aa9a>] start_secondary+0xd9/0xdb

bt outputs the backtrace:

PID: 0      TASK: ffff880211251700  CPU: 5   COMMAND: "swapper/5"
 #0 [ffff88021ed4aba0] machine_kexec at ffffffff8103947a
 #1 [ffff88021ed4ac10] crash_kexec at ffffffff810b52c8
 #2 [ffff88021ed4ace0] panic at ffffffff81644347
 #3 [ffff88021ed4ad60] mce_panic.part.14 at ffffffff8102abeb
 #4 [ffff88021ed4adb0] mce_panic at ffffffff8102ac80
 #5 [ffff88021ed4ade0] mce_reign at ffffffff8102aec4
 #6 [ffff88021ed4ae40] mce_end at ffffffff8102b175
 #7 [ffff88021ed4ae70] do_machine_check at ffffffff8102b92c
 #8 [ffff88021ed4af50] machine_check at ffffffff8165d78c
    [exception RIP: intel_idle+191]
    RIP: ffffffff8136d48f  RSP: ffff880211261e38  RFLAGS: 00000046
    RAX: 0000000000000020  RBX: 0000000000000008  RCX: 0000000000000001
    RDX: 0000000000000000  RSI: ffff880211261fd8  RDI: ffffffff81c12f00
    RBP: ffff880211261e98   R8: 00000000fffffffc   R9: 0000000000000f9f
    R10: 0000000000001e95  R11: 0000000000000000  R12: 0000000000000003
    R13: ffff88021ed5ac70  R14: 0000000000000020  R15: 12d818fb42cfe42b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <MCE exception stack> ---
 #9 [ffff880211261e38] intel_idle at ffffffff8136d48f
#10 [ffff880211261ea0] cpuidle_idle_call at ffffffff815082d1
#11 [ffff880211261f00] cpu_idle at ffffffff8101322a

Any ideas?

Naftuli Kay
  • 39,676
  • 3
    Are you using a binary blob graphics driver? – jordanm Jan 07 '13 at 21:58
  • Yes, NVIDIA. Is there somewhere I can get logs for that? – Naftuli Kay Jan 07 '13 at 22:14
  • Are there any panic messages in /var/log/kern.log or syslog after reboot? You can log in from another pc and have a tail -f /var/log/kern.log running and try to catch it that way. – ott-- Jan 07 '13 at 22:17
  • Nothing shows up in /var/log/kern.log, but now looking into syslog. – Naftuli Kay Jan 08 '13 at 04:09
  • I've downgraded my NVIDIA driver to 304 stable which is a pretty old driver and I'm still seeing the crash. Updated the OP with details. – Naftuli Kay Jan 09 '13 at 04:38
  • 1
    (1) Run your machine without overclocking, ensure the panic still occurs. (2) Use a network console to capture the panic. – derobert Jan 10 '13 at 21:17
  • I've tried capturing over the network, but no output gets logged. ssh myhost "tail -f /var/log/kern.log" outputs nothing, even when it crashes. – Naftuli Kay Jan 10 '13 at 21:44
  • Try to separate it, first ssh to the host, then do the tail -f, not as one command. – ott-- Jan 10 '13 at 22:04
  • 1
    Do you still get the crash if your clock it back to the normal CPU speed? – ProfessionalAmateur Jan 11 '13 at 22:44
  • 1
    Do not compare such attitude under window and linux!! As drivers are not sames!!! (I suspect NVidia to make more effort for stabilising window version). The first suspect is overclocking for mee too! Please answer repetitive question: Do this change if no overclocking? – F. Hauri - Give Up GitHub Jan 12 '13 at 22:45
  • @TKKocheran That's not a network console. I guess I do need to write up my reference question on how to get output to post. – derobert Feb 01 '13 at 14:15
  • Do you have the option of downrevving your kernel to a 3.0.x kernel? Instead of 3.2? What kind of machine is it? and how much and what kind of memory do you have? – Tim Kennedy Jan 11 '13 at 19:59
  • 8GB RAM for now, the rest of my RAM is in the mail from a Corsair RMA. Total will be 32GB. I need to get to 3.3 at least for EFI stub loader support. – Naftuli Kay Jan 11 '13 at 20:03
  • I had some similar interrupt crashes with kernel panic freeze and caps and scroll lights flash and unresponsive computer (except for reset button). This occured with ONE of my machines where both were running 14.04 ubuntu. I threw out the belkin wireless usb and replaced it with another similar one. That problem seems to be fixed, I did not figure on an USB device causing a total crash! –  Oct 11 '14 at 02:08
  • @NaftuliTzviKay, when you say ´ When I run apport-unpack on the crashed kernel file and then crash on the VmCore crash dump´, what commands were exactly? because when I tested linux-crashdump my /var/crash was empty after a kernel panic test – auraham Oct 25 '14 at 07:48
  • If you run mcelog --ascii as suggested and copy-paste the previous log lines, removing the timestamps like [ 1561.519946]... I get some more details listed. It says "MCA: Internal Timer error". (mcelog also says "Hardware event. This is not a software error."). – sourcejedi Aug 04 '18 at 20:18

5 Answers5

38

I have two suggestions to start.

The first you're not going to like. No matter how stable you think your overclocked system is, it would be my first suspect. And any developer you report the problem to will say the same thing. Your stable test workload isn't necessarily using the same instructions, stressing the memory subsystem as much, whatever. Stop overclocking. If you want people to believe the problem's not overclocking, then make it happen when not overclocking so you can get a clean bug report. This will make a huge difference in how much effort other people will invest in solving this problem. Having bug-free software is a point of pride, but reports from people with particularly questionable hardware setups are frustrating time-sinks that probably don't involve a real bug at all.

The second is to get the oops data, which, as you've noticed, doesn't go to any of the places you've mentioned.  If the crash happens only while running X11, I think local console is pretty much out (it's a pain anyway), so you need to do this over a serial console, over the network, or by saving to a local disk (which is trickier than it may sound, because you don't want an untrustworthy kernel to corrupt your filesystem).  Here are some ways to do this:

  • use netdump to save to a server over the network. I haven't done this in years, so I'm not sure this software is still around and working with modern kernels, but it's easy enough that it's worth a shot.
  • boot using a serial console (archived version, current version); you'll need a serial port free on both machines (whether an old-school one or a USB serial adapter) and a null modem cable; you'd configure the other machine to save the output.
  • kdump seems to be what the cool kids use nowadays, and seems quite flexible, although it wouldn't be my preference because it looks complex to set up. In short, it involves booting a different kernel that can do anything and inspect the former kernel's memory contents, but you have to essentially build the whole process and I don't see a lot of canned options out there.  Update: There are some nice distro things, actually; on Ubuntu, linux-crashdump (archived version, current version).

Once you get the debug info, there's a tool called ksymoops (archived version, current version (with ads)) that you can use to turn the addresses into symbol names and start getting an idea how your kernel crashed.  And if the symbolized dump doesn't mean anything to you, at least this is something helpful to report here or perhaps on your Linux distribution's mailing list / bug tracker.


From crash on your crashdump, you can try typing log and bt to get a bit more information (things logged during the panic and a stack back trace).  Your Fatal Machine check seems to be coming from here, though. From skimming the code, your processor has reported a Machine Check Exception – a hardware problem.  Again, my first bet would be due to overclocking. It seems like there might be a more specific message in the log output which could tell you more.

Also from that code, it looks like if you boot with the mce=3 kernel parameter, it will stop crashing... but I wouldn't really recommend this except as a diagnostic step. If the Linux kernel thinks this error is worth crashing over, it's probably right.

Scott Lamb
  • 1,069
  • 6
  • 7
  • If the overclock is the problem, I'll be able to see a clock cycle get missed in crash logs, so at the end of the day, I'll know what the problem is. That's my goal: to figure out what's going wrong. If it's my overclock, then fine, I'd just like to know what the problem is. – Naftuli Kay Jan 10 '13 at 21:03
  • 1
    I don't think overclocking failures are as obvious as that to spot in the logs; I'm not a processor expert, but it's not like the whole processor correctly handles the clock cycle or indicates to the OS somehow that it missed it. Let me know if you have trouble getting logs, but IMHO by far the easiest way to know if it's an overclocking problem is to see if it happens when not overclocking. – Scott Lamb Jan 10 '13 at 21:28
  • Okay, I'll do that after backing up my settings. I might first just see if I can reproduce the crash in Windows. – Naftuli Kay Jan 10 '13 at 21:45
  • 1
    While I'm thankful to never ever encounter a BSOD in Linux, it would seem strange to me that while Windows would log and display a problem, Linux wouldn't be able to. – Naftuli Kay Jan 10 '13 at 21:51
  • One of those little quirks. :-/ There's no fundamental reason Ubuntu or RedHat couldn't set up a nice kdump-based system for crash logging and display out of the box, but no one's done it as far as I know. – Scott Lamb Jan 10 '13 at 21:58
  • Actually, I take that back. On Ubuntu, there is a linux-crashdump package you can install fairly easily to automatically put crashes in /var/crash. What distribution are you using? – Scott Lamb Jan 10 '13 at 22:21
  • 1
    I've updated the question, as I was able to crash the machine while running linux-crashdump and obtain a crash dump file which hopefully has enough information to determine the cause. – Naftuli Kay Jan 11 '13 at 00:37
  • Sweet. Updated my answer as well. – Scott Lamb Jan 11 '13 at 01:20
  • Thanks, I'll look into that. I've heard that this issue pops up sometimes on UEFI motherboards when booting into BIOS legacy mode, which is the case on this system. This could explain why I haven't seen the issue on Windows, as it boots EFI. I'm also running i7z as a daemon in the background and it's probably doing some devious stuff to get live processor frequencies, C-states, and other stuff. Needless to say, I've disabled that and I'll see if it crashes again. – Naftuli Kay Jan 11 '13 at 02:35
  • Got the log! Awesome help. I've updated the original post with the output of that log. I'm finally seeing the error now, any ideas on what might be causing it? – Naftuli Kay Jan 11 '13 at 02:50
  • All it means to me is that your processor isn't working. Probably the overclocking, maybe the other thing you mentioned (it's not something I've heard about), maybe a defective unit. – Scott Lamb Jan 11 '13 at 06:15
  • I would second the "Overclock is culprit" thought. MCE mostly occurs due to hardware issues. But, a segmentation fault in any module code can cause the same too.

    Two years back, my new i7 2600k was giving me the same MCE issue, even when I was not doing anything on the computer. When I dug a little deeper, I found the BIOS version I was using with my Intel motherboard was not properly supporting the then new processor. I updated the BIOS and the problem was gone. So I will suggest you to check on that route too.

    – Soumyadip DM Jan 11 '13 at 21:57
  • Now that I know what's failing, is there a way for me to cause the crash with a given command? – Naftuli Kay Jan 11 '13 at 23:38
  • Not sure. I'm a software guy; this is the limit of my expertise. – Scott Lamb Jan 12 '13 at 07:14
  • 1
    @user643011: Thanks for catching the broken link and suggesting an edit — but, when you do this (in the future), please check the entire post for problems, and fix them all. – G-Man Says 'Reinstate Monica' Feb 15 '24 at 14:40
5

a) Check if kernel messages are being logged to a file by rsyslog daemon

vi /etc/rsyslog.conf

And add the following

kern.*                 /var/log/kernel.log

Restart the rsyslog service.

/etc/initd.d/rsyslog restart

b) Take a note of the loaded modules

`lsmod >/your/home/dir`

c) As the panic is not reproducible, wait for it to happen

d) Once the panic has occurred, boot the system using a live or emergency CD

e) Mount the filesystems (usually / will suffice if /var and /home are not separate file systems) of the affected system (pvs, vgs, lvs commands need to be run if you are using LVM on the affected system to bring up the LV) mount -t ext4 /dev/sdXN /mnt

f) Go to /mnt/var/log/ directory and check the kernel.log file. This should give you enough information to figure out if the panic is happening for a particular module or something else.

  • Log results from that are pretty inconclusive: http://pastebin.com/VdYAHgiH – Naftuli Kay Jan 10 '13 at 22:08
  • 2
    As to my experience, kernel crashes rarely get into kernel.log, as log information needs to go a pretty long way via syslog, filesystem driver, disk cache and disk driver. Most simple and elegant way is to use netconsole kernel module. – dma_k Jun 14 '15 at 12:00
2

Is your processor overclocked? I had this same issue today when I was playing with the multiplier in the over-clocking menu in my BIOS; various multipliers around 20x would cause this to happen. I reduced it down to 18.5x (3.7GHz) and the problem went away; I think it was a motherboard/power issue.

Michael Mrozek
  • 93,103
  • 40
  • 240
  • 233
  • 2
    Yes, it had everything to do with overclocking. Evidently, Windows seems to be a bit more fault-tolerant with certain processor faults, if the CPU can keep going. I might start booting with mce=3 to prevent crashing, but in the past, I've simply increased the voltage each time it's crashed (which hasn't been so often). Something to note is that I'm using an offset voltage, which is generally speaking more unstable. – Naftuli Kay May 13 '13 at 20:14
1

Most definitely a processor issue, notice the lines that say: TSC 539b174dead ADDR 3fe98d264ebd MISC 1 [ 1561.519950] [Hardware Error]: PROCESSOR 0:206a7 TIME 1357862746 SOCKET 0 APIC 1 microcode 28. Processor 0 is what the kernel used to process the crash (matters in multi-cpu systems) and socket 0 is the offending processor (though I assume you only have 1). Either it is bad or as you noted being overclocked cause for faults. I know you said you took it through prime95 but since I do not have more information on how old the system is I am grabbing at a few straws, how does your thermal paste look, and have you checked to make sure your LGA (under the CPU) looks alright? I am thinking maybe bent pins or some paste under the LGA. Again just root causing here.

If that fails to fix the issue there is a little trick you can do to use your SMBIOS to find where the panic hits exactly, another line (TSC 539b174de9d ADDR 3fe98d264ebd MISC 1) is basically SMBIOS data that can show where the crash happened. When your machine is up, in command line run, echo "TSC 539b174de9d ADDR 3fe98d264ebd MISC 1" | sudo mcelog --ascii --dmi to get the output, this will tell you it is a hardware error and even what DIMM it was processing on, this can point to a faulty DIMM or bus path, if the DIMM failure jumps around with every crash however, this points to the CPU.

0

We had a mikrotik router installed on an old rig. The fan stopped spinning and causing the processor to heat up. The router then starts to Kernel Panic every now and then. After changing the CPU fan everything went well.

Since your are overclocking your machine it can be a possible cause.