0

A while ago, I repurposed my old desktop computer into a debian server, and it was able to run flawlessly for half a year.

But then I decided to move the machine somewhere with a better Internet connexion, and to add a bunch of HDD to make it a proper storage server (a homemade NAS, so to speak).

From now on, the server randomly crashes. Sometimes, it takes more than a month to crash. Sometimes, it takes a day. Lately, the crash frequency is about 2-3 days.

Looking at dmesg, the cause of the crashes seems different every single time. I'm absolutely clueless about what is the cause of the crashes.

Setup

  • CPU: Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
  • MotherBoard: MSI MS-7821/Z87-G45 GAMING
  • The machine runs Debian Stretch on Linux 4.9.0-8-amd64
  • Kdump is installed
  • The system is installed on an Samsung SSD 840 PRO (128 GB)
  • 5 8-TB Western Digital Red HDDs for the storage
  • HDDs were in software RAID5 configuration using mdadm at the beginning, but now are managed by ZFS using raidz2.
  • Apache2 (with nextcloud) and transmission-daemon run

dmesg

dmesg.201904140557
[230866.137537] PANIC: double fault, error_code: 0x0
[230866.137548] PANIC: double fault, error_code: 0x0
[230866.137550] CPU: 2 PID: 25608 Comm: apache2 Tainted: P          IO    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[230866.137551] Hardware name: MSI MS-7821/Z87-G45 GAMING (MS-7821), BIOS V1.1 05/03/2013
[230866.137551] task: ffff8d7d1eabe0c0 task.stack: ffffa02483d5c000
[230866.137555] RIP: 0010:[<ffffffffad8192fa>]  [<ffffffffad8192fa>] syscall_return_via_sysret+0x3e/0x4d
[230866.137556] RSP: 0018:ffffa02483d5ff50  EFLAGS: 00010002
[230866.137556] RAX: 0000000510035080 RBX: 0000000000000000 RCX: 00007fec9d79eacf
[230866.137557] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[230866.137557] RBP: 0000000000000000 R08: 00007fec6461ee20 R09: 0000000000000000
[230866.137558] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[230866.137558] R13: 0000000000000000 R14: 00007fec6461ee20 R15: 0000000000000000
[230866.137559] FS:  00007fec6461f700(0000) GS:ffff8d7e9fb00000(0000) knlGS:0000000000000000
[230866.137560] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[230866.137560] CR2: ffffa02483d5ff48 CR3: 0000000510034000 CR4: 0000000000160670
[230866.137561] Stack:
[230866.137563]  0000000000000000 0000000000000000 00007fec6461ee20 0000000000000000
[230866.137564]  0000000000000000 0000000000000000 0000000000000000 0000000000000293
[230866.137565]  0000000000000000 0000000000000000 00007fec6461ee20 0000000000000000
[230866.137565] Call Trace:
[230866.137580] Code: 50 48 8b 54 24 60 48 8b 74 24 68 48 8b 7c 24 70 50 90 0f 20 d8 65 48 0b 04 25 e0 02 01 00 78 08 65 88 04 25 e7 02 01 00 0f 22 d8 <58> 48 8b a4 24 98 00 00 00 0f 01 f8 48 0f 07 50 90 0f 20 d8 65 
[230866.137580] Kernel panic - not syncing: Machine halted.
[230866.137581] CPU: 2 PID: 25608 Comm: apache2 Tainted: P          IO    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[230866.137582] Hardware name: MSI MS-7821/Z87-G45 GAMING (MS-7821), BIOS V1.1 05/03/2013
[230866.137583]  0000000000000000 ffffffffad534524 ffff8d7e9fb07f00 ffff8d7e9fb07f18
[230866.137584]  ffffffffad380ecd ffffffff00000008 ffff8d7e9fb07f28 ffff8d7e9fb07ec0
[230866.137585]  88dd6d6a799c212f 00000000000000c8 0000000000000092 0000000000000000
[230866.137585] Call Trace:
[230866.137589]  <#DF> 
[230866.137589]  [<ffffffffad534524>] ? dump_stack+0x5c/0x78
[230866.137591]  [<ffffffffad380ecd>] ? panic+0xe4/0x23f
[230866.137592]  [<ffffffffad258ac9>] ? df_debug+0x29/0x30
[230866.137594]  [<ffffffffad227b0f>] ? do_double_fault+0x9f/0x130
[230866.137595]  [<ffffffffad81a038>] ? double_fault+0x28/0x30
[230866.137596]  [<ffffffffad8192fa>] ? syscall_return_via_sysret+0x3e/0x4d

dmesg.201904172335
[322137.449206] general protection fault: 0000 [#1] SMP
[322137.464088] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc xt_multiport iptable_filter wireguard(O) ip6_udp_tunnel udp_tunnel overlay nls_ascii nls_cp437 vfat fat snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic zfs(PO) intel_rapl zunicode(PO) x86_pkg_temp_thermal zavl(PO) intel_powerclamp zcommon(PO) znvpair(PO) snd_hda_intel kvm_intel spl(O) kvm i915 snd_hda_codec irqbypass snd_hda_core snd_hwdep snd_pcm crct10dif_pclmul crc32_pclmul iTCO_wdt ghash_clmulni_intel drm_kms_helper intel_cstate mei_me iTCO_vendor_support snd_timer drm intel_uncore snd
[322137.678356]  soundcore evdev i2c_algo_bit mxm_wmi mei efi_pstore intel_rapl_perf lpc_ich sg shpchp serio_raw mfd_core pcspkr efivars wmi intel_smartconnect video button nfsd auth_rpcgss oid_registry nfs_acl lockd grace nct6775 hwmon_vid coretemp sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic usbhid hid dm_mod sd_mod xhci_pci ahci ehci_pci xhci_hcd ehci_hcd crc32c_intel libahci libata aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper psmouse cryptd scsi_mod i2c_i801 i2c_smbus alx usbcore mdio thermal usb_common fan
[322137.867812] CPU: 2 PID: 2034 Comm: transmission-da Tainted: P          IO    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[322137.898560] Hardware name: MSI MS-7821/Z87-G45 GAMING (MS-7821), BIOS V1.1 05/03/2013
[322137.922267] task: ffff9d0366de8040 task.stack: ffffb6ca48838000
[322137.940254] RIP: 0010:[<ffffffffc0dc49e2>]  [<ffffffffc0dc49e2>] zio_create+0x52/0x470 [zfs]
[322137.965860] RSP: 0018:ffffb6ca4883b970  EFLAGS: 00010282
[322137.982034] RAX: fbff9cff4e756040 RBX: fbff9cff4e756040 RCX: fbff9cff4e756040
[322138.003667] RDX: 0000000000000000 RSI: 0000000002404200 RDI: fbff9cff4e756048
[322138.025297] RBP: ffff9d03710ec680 R08: 000039c6a0245fd0 R09: 0000000000000002
[322138.046929] R10: 0000000000000000 R11: 0000000000000000 R12: ffffb6ca4883bb30
[322138.068560] R13: 0000000000000001 R14: 00000000000f99d1 R15: ffff9cff040b1a10
[322138.090191] FS:  00007fee5e413700(0000) GS:ffff9d039fb00000(0000) knlGS:0000000000000000
[322138.114681] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[322138.132151] CR2: 000056466d3a1060 CR3: 00000005e6e22000 CR4: 0000000000160670
[322138.153783] Stack:
[322138.160066]  0000000000004000 ffff9cfebc544000 ffff9d0373c44000 ffff9d03710ec680
[322138.182681]  ffffffffc0d1eae0 ffff9cff040b1a10 ffff9cfebc544000 0000000000004000
[322138.205299]  ffff9d0373c44000 ffffffffc0dc551c ffffffffc0d1eae0 ffff9d027d98eaa8
[322138.227918] Call Trace:
[322138.235528]  [<ffffffffc0d1eae0>] ? arc_hdr_destroy+0x1e0/0x1e0 [zfs]
[322138.255086]  [<ffffffffc0dc551c>] ? zio_read+0xcc/0xe0 [zfs]
[322138.272293]  [<ffffffffc0d1eae0>] ? arc_hdr_destroy+0x1e0/0x1e0 [zfs]
[322138.291847]  [<ffffffffc0d21eb0>] ? arc_read+0x520/0xa30 [zfs]
[322138.309576]  [<ffffffffc0d28b8e>] ? dbuf_read+0x29e/0x7d0 [zfs]
[322138.327569]  [<ffffffffc0d294f8>] ? __dbuf_hold_impl+0x438/0x4d0 [zfs]
[322138.347379]  [<ffffffffc0d295fb>] ? dbuf_hold_impl+0x6b/0x90 [zfs]
[322138.366147]  [<ffffffffc0d298fb>] ? dbuf_hold+0x2b/0x60 [zfs]
[322138.383622]  [<ffffffffc0d30799>] ? dmu_buf_hold_array_by_dnode+0xf9/0x460 [zfs]
[322138.406034]  [<ffffffffc0d313d0>] ? dmu_read_uio_dnode+0x50/0xf0 [zfs]
[322138.426487]  [<ffffffffc0d323cd>] ? dmu_read_uio_dbuf+0x3d/0x60 [zfs]
[322138.446691]  [<ffffffffc0db0b97>] ? zfs_read+0x127/0x3b0 [zfs]
[322138.465045]  [<ffffffffc0dcae24>] ? zpl_read_common_iovec+0x84/0xd0 [zfs]
[322138.486274]  [<ffffffffc0dcb8e1>] ? zpl_iter_read+0xa1/0xe0 [zfs]
[322138.505406]  [<ffffffff8ae0aacd>] ? new_sync_read+0xdd/0x130
[322138.523175]  [<ffffffff8ae0b261>] ? vfs_read+0x91/0x130
[322138.539686]  [<ffffffff8ae0c8f0>] ? SyS_pread64+0x90/0xb0
[322138.556649]  [<ffffffff8ac03b7d>] ? do_syscall_64+0x8d/0xf0
[322138.574196]  [<ffffffff8b21924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[322138.595828] Code: 10 31 f6 4c 89 44 24 08 4c 89 0c 24 4c 8b a4 24 88 00 00 00 44 8b ac 24 90 00 00 00 e8 68 02 f4 ff 48 8d 78 08 48 89 c1 48 89 c3 <48> c7 00 00 00 00 00 48 c7 80 30 04 00 00 00 00 00 00 31 c0 48 
[322138.656162] RIP  [<ffffffffc0dc49e2>] zio_create+0x52/0x470 [zfs]
[322138.675286]  RSP <ffffb6ca4883b970>

dmesg.201904260559
[72133.666580] general protection fault: 0000 [#1] SMP
[72133.681200] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter overlay wireguard(O) ip6_udp_tunnel udp_tunnel nls_ascii nls_cp437 vfat fat snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp zfs(PO) zunicode(PO) kvm_intel snd_hda_codec_realtek kvm zavl(PO) snd_hda_codec_generic irqbypass crct10dif_pclmul zcommon(PO) crc32_pclmul snd_hda_intel znvpair(PO) i915 snd_hda_codec spl(O) ghash_clmulni_intel intel_cstate snd_hda_core snd_hwdep snd_pcm intel_uncore iTCO_wdt efi_pstore iTCO_vendor_support drm_kms_helper snd_timer drm
[72133.895207]  mxm_wmi intel_rapl_perf mei_me sg snd serio_raw mei i2c_algo_bit lpc_ich pcspkr soundcore mfd_core evdev efivars shpchp wmi video intel_smartconnect button nct6775 hwmon_vid coretemp nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic dm_mod usbhid hid sd_mod ahci libahci ehci_pci xhci_pci xhci_hcd ehci_hcd crc32c_intel libata aesni_intel psmouse aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd i2c_i801 scsi_mod i2c_smbus alx mdio usbcore usb_common fan thermal
[72134.084709] CPU: 3 PID: 4246 Comm: java Tainted: P          IO    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[72134.112335] Hardware name: MSI MS-7821/Z87-G45 GAMING (MS-7821), BIOS V1.1 05/03/2013
[72134.135784] task: ffff8dbb009d7100 task.stack: ffffb42103b38000
[72134.153510] RIP: 0010:[<ffffffffa9eea7a8>]  [<ffffffffa9eea7a8>] hrtimer_active+0x28/0x50
[72134.178049] RSP: 0018:ffffb42103b3be28  EFLAGS: 00010046
[72134.193962] RAX: 0000000000000000 RBX: ffff8dbb00c3c600 RCX: 0000000000000023
[72134.215337] RDX: fffd8dbb1fb94c00 RSI: 0000000000000008 RDI: ffff8dbb00c3c600
[72134.236710] RBP: 0000000000000000 R08: ffffffffaaa3eee0 R09: ffff8dbac7341380
[72134.258082] R10: 0000000000000013 R11: ffff8dbb01041b38 R12: ffff8dbb00c3c600
[72134.279452] R13: ffffb42103b3bec0 R14: 0000000000000000 R15: 0000000000000000
[72134.300824] FS:  00007fd2336ce700(0000) GS:ffff8dbb1fb80000(0000) knlGS:0000000000000000
[72134.325054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[72134.342261] CR2: 00007f36d94688a0 CR3: 00000005f211e000 CR4: 0000000000160670
[72134.363633] Stack:
[72134.369656]  ffffffffa9eeac77 0000000000000000 8a7c0674a85ffec5 ffff8dbb00c3c688
[72134.392008]  ffffb42103b3beb0 ffff8dbb00c3c600 ffffffffaa057b59 00007fd24811c410
[72134.414343]  ffffb42103b3bee0 ffff8dbb01041b00 0000000000000001 8a7c0674a85ffec5
[72134.436702] Call Trace:
[72134.444039]  [<ffffffffa9eeac77>] ? hrtimer_try_to_cancel+0x27/0x110
[72134.463080]  [<ffffffffaa057b59>] ? do_timerfd_settime+0x119/0x430
[72134.481590]  [<ffffffffaa058127>] ? SyS_timerfd_settime+0x57/0xb0
[72134.499837]  [<ffffffffa9e03b7d>] ? do_syscall_64+0x8d/0xf0
[72134.516529]  [<ffffffffaa41924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[72134.537380] Code: 00 00 00 0f 1f 44 00 00 48 8b 57 30 eb 1d 80 7f 38 00 75 32 48 3b 78 08 74 2c 39 50 04 75 e9 48 8b 57 30 48 8b 0a 48 39 c8 74 21 <48> 8b 02 8b 50 04 f6 c2 01 74 d8 f3 90 8b 50 04 f6 c2 01 75 f6 
[72134.596590] RIP  [<ffffffffa9eea7a8>] hrtimer_active+0x28/0x50
[72134.614098]  RSP <ffffb42103b3be28>

dmesg.201904270957
[100366.341655] general protection fault: 0000 [#1] SMP
[100366.356517] Modules linked in: veth xt_nat xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter overlay wireguard(O) ip6_udp_tunnel udp_tunnel nls_ascii nls_cp437 vfat fat snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel zfs(PO) zunicode(PO) kvm zavl(PO) irqbypass zcommon(PO) crct10dif_pclmul znvpair(PO) crc32_pclmul spl(O) ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic i915 intel_cstate iTCO_wdt iTCO_vendor_support snd_hda_intel intel_uncore mxm_wmi evdev serio_raw efi_pstore intel_rapl_perf snd_hda_codec pcspkr snd_hda_core
[100366.570669]  snd_hwdep drm_kms_helper mei_me sg snd_pcm lpc_ich snd_timer drm snd mfd_core mei i2c_algo_bit soundcore shpchp intel_smartconnect wmi efivars video button nct6775 hwmon_vid coretemp nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto ecb mbcache raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod hid_generic dm_mod usbhid hid sd_mod ahci libahci libata xhci_pci crc32c_intel aesni_intel ehci_pci psmouse aes_x86_64 glue_helper i2c_i801 lrw xhci_hcd ehci_hcd gf128mul i2c_smbus ablk_helper cryptd usbcore alx scsi_mod mdio usb_common fan thermal
[100366.760030] CPU: 3 PID: 28567 Comm: apache2 Tainted: P          IO    4.9.0-8-amd64 #1 Debian 4.9.144-3.1
[100366.788960] Hardware name: MSI MS-7821/Z87-G45 GAMING (MS-7821), BIOS V1.1 05/03/2013
[100366.812667] task: ffff8c41b1eb4100 task.stack: ffffac678f30c000
[100366.830659] RIP: 0010:[<ffffffff8549800a>]  [<ffffffff8549800a>] __task_pid_nr_ns+0x3a/0x90
[100366.855979] RSP: 0018:ffffac678f30fcc8  EFLAGS: 00010282
[100366.872152] RAX: 0000000000000508 RBX: ffff8c4292b7ba40 RCX: 0000000000000001
[100366.893787] RDX: ffffffff86045d20 RSI: 0000000000000004 RDI: f7ff8c428aaa95c8
[100366.915418] RBP: ffffac678f30ff30 R08: 0000000000000000 R09: 0000000000000000
[100366.937052] R10: 0000000000000000 R11: 0000000000000000 R12: ffffac678f30fd78
[100366.958683] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
[100366.980317] FS:  00007f29e0c20700(0000) GS:ffff8c445fb80000(0000) knlGS:0000000000000000
[100367.004809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[100367.022279] CR2: 00007f773f92a1f8 CR3: 00000002575ee000 CR4: 0000000000160670
[100367.043913] Stack:
[100367.050195]  ffffffff8569cb93 00007f29e0c1fe20 0000000000000000 0000000000000000
[100367.072811]  0000000000000000 ffffffff8608b548 ffff8c400bc4ef80 ffff8c4292b7bb08
[100367.095407]  ffffac678f30fd20 00000000000b0008 0000000000000000 ffffac678f30fd20
[100367.118027] Call Trace:
[100367.125627]  [<ffffffff8569cb93>] ? SYSC_semtimedop+0x3b3/0xc50
[100367.143623]  [<ffffffff8552bd04>] ? __seccomp_filter+0x74/0x270
[100367.161615]  [<ffffffff8542f1f0>] ? recalibrate_cpu_khz+0x10/0x10
[100367.180130]  [<ffffffff854f01dc>] ? ktime_get_ts64+0x4c/0xf0
[100367.197342]  [<ffffffff85620bbf>] ? poll_select_copy_remaining+0xdf/0x150
[100367.217934]  [<ffffffff85403337>] ? syscall_trace_enter+0x117/0x2c0
[100367.236964]  [<ffffffff85403b7d>] ? do_syscall_64+0x8d/0xf0
[100367.253918]  [<ffffffff85a1924e>] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[100367.275029] Code: 00 00 00 74 4e 85 f6 b8 08 05 00 00 74 1a 83 fe 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 08 05 00 00 48 8b bf d0 04 00 00 48 01 c7 <48> 8b 0f 48 85 c9 74 20 8b b2 30 08 00 00 31 c0 3b 71 04 77 0d 
[100367.334428] RIP  [<ffffffff8549800a>] __task_pid_nr_ns+0x3a/0x90
[100367.352738]  RSP <ffffac678f30fcc8>

Command output

# uname -a
Linux example.com 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux
# lsmod
Module                  Size  Used by
ipt_REJECT             16384  6
nf_reject_ipv4         16384  1 ipt_REJECT
veth                   16384  0
xt_nat                 16384  1
xt_tcpudp              16384  3
ipt_MASQUERADE         16384  2
nf_nat_masquerade_ipv4    16384  1 ipt_MASQUERADE
nf_conntrack_netlink    36864  0
nfnetlink              16384  2 nf_conntrack_netlink
xfrm_user              36864  1
xfrm_algo              16384  1 xfrm_user
iptable_nat            16384  1
nf_conntrack_ipv4      16384  2
nf_defrag_ipv4         16384  1 nf_conntrack_ipv4
nf_nat_ipv4            16384  1 iptable_nat
xt_addrtype            16384  2
xt_conntrack           16384  1
nf_nat                 24576  3 xt_nat,nf_nat_masquerade_ipv4,nf_nat_ipv4
nf_conntrack          114688  6 nf_conntrack_ipv4,nf_conntrack_netlink,nf_nat_masquerade_ipv4,xt_conntrack,nf_nat_ipv4,nf_nat
br_netfilter           24576  0
bridge                135168  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
xt_multiport           16384  1
iptable_filter         16384  1
wireguard             217088  0
ip6_udp_tunnel         16384  1 wireguard
udp_tunnel             16384  1 wireguard
overlay                49152  1
nls_ascii              16384  1
nls_cp437              20480  1
vfat                   20480  1
fat                    69632  1 vfat
snd_hda_codec_hdmi     49152  1
intel_rapl             20480  0
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
kvm_intel             200704  0
kvm                   598016  1 kvm_intel
zfs                  2707456  8
irqbypass              16384  1 kvm
crct10dif_pclmul       16384  0
zunicode              331776  1 zfs
crc32_pclmul           16384  0
zavl                   16384  1 zfs
ghash_clmulni_intel    16384  0
zcommon                53248  1 zfs
intel_cstate           16384  0
znvpair                90112  2 zcommon,zfs
snd_hda_codec_realtek    90112  1
snd_hda_codec_generic    69632  1 snd_hda_codec_realtek
snd_hda_intel          36864  0
i915                 1257472  2
snd_hda_codec         135168  4 snd_hda_intel,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec_realtek
drm_kms_helper        155648  1 i915
intel_uncore          118784  0
spl                    98304  3 znvpair,zcommon,zfs
snd_hda_core           90112  5 snd_hda_intel,snd_hda_codec,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec_realtek
iTCO_wdt               16384  0
mei_me                 36864  0
efi_pstore             16384  0
snd_hwdep              16384  1 snd_hda_codec
mxm_wmi                16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
evdev                  24576  2
drm                   360448  3 i915,drm_kms_helper
snd_pcm               110592  4 snd_hda_intel,snd_hda_codec,snd_hda_core,snd_hda_codec_hdmi
snd_timer              32768  1 snd_pcm
intel_rapl_perf        16384  0
efivars                20480  1 efi_pstore
serio_raw              16384  0
lpc_ich                24576  0
sg                     32768  0
snd                    86016  8 snd_hda_intel,snd_hwdep,snd_hda_codec,snd_timer,snd_hda_codec_hdmi,snd_hda_codec_generic,snd_hda_codec_realtek,snd_pcm
pcspkr                 16384  0
mei                   102400  1 mei_me
i2c_algo_bit           16384  1 i915
soundcore              16384  1 snd
mfd_core               16384  1 lpc_ich
shpchp                 36864  0
wmi                    16384  1 mxm_wmi
intel_smartconnect     16384  0
video                  40960  1 i915
button                 16384  1 i915
nfsd                  331776  13
auth_rpcgss            61440  1 nfsd
oid_registry           16384  1 auth_rpcgss
nfs_acl                16384  1 nfsd
lockd                  90112  1 nfsd
grace                  16384  2 nfsd,lockd
sunrpc                344064  18 auth_rpcgss,nfsd,nfs_acl,lockd
nct6775                57344  0
hwmon_vid              16384  1 nct6775
coretemp               16384  0
efivarfs               16384  1
ip_tables              24576  2 iptable_filter,iptable_nat
x_tables               36864  9 xt_multiport,ipt_REJECT,xt_nat,ip_tables,iptable_filter,xt_tcpudp,ipt_MASQUERADE,xt_addrtype,xt_conntrack
autofs4                40960  3
ext4                  585728  2
crc16                  16384  1 ext4
jbd2                  106496  1 ext4
fscrypto               28672  1 ext4
ecb                    16384  0
mbcache                16384  3 ext4
raid10                 49152  0
raid456               106496  0
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5 async_xor,async_pq,raid456,async_memcpy,async_raid6_recov
xor                    24576  1 async_xor
raid6_pq              110592  3 async_pq,raid456,async_raid6_recov
libcrc32c              16384  1 raid456
crc32c_generic         16384  0
raid1                  36864  0
raid0                  20480  0
multipath              16384  0
linear                 16384  0
md_mod                135168  6 raid1,raid10,multipath,linear,raid0,raid456
hid_generic            16384  0
usbhid                 53248  0
hid                   122880  2 hid_generic,usbhid
dm_mod                118784  6
sd_mod                 49152  14
ehci_pci               16384  0
xhci_pci               16384  0
xhci_hcd              188416  1 xhci_pci
ahci                   40960  8
ehci_hcd               81920  1 ehci_pci
crc32c_intel           24576  5
libahci                32768  1 ahci
aesni_intel           167936  1
aes_x86_64             20480  1 aesni_intel
libata                249856  2 ahci,libahci
glue_helper            16384  1 aesni_intel
lrw                    16384  1 aesni_intel
usbcore               253952  6 usbhid,ehci_hcd,xhci_pci,xhci_hcd,ehci_pci
gf128mul               16384  1 lrw
ablk_helper            16384  1 aesni_intel
i2c_i801               24576  0
cryptd                 24576  3 ablk_helper,ghash_clmulni_intel,aesni_intel
psmouse               135168  0
i2c_smbus              16384  1 i2c_i801
alx                    45056  0
scsi_mod              225280  3 sd_mod,libata,sg
mdio                   16384  1 alx
usb_common             16384  1 usbcore
fan                    16384  0
thermal                20480  0

UPDATE

I ran memtest86 (the original from memtest86.com) both before and after re-seating the RAM modules: memtest.log

No error was found.

UPDATE

Re-seating RAM modules had no effect. So I explored new hypothesis.

I checked for any electrical interference, but there is no correlation between the crash times and the use of heavy electrical machines.

I also checked the correlation between disk access and crashes. It appears that the crashes can happen even with low disk activity, but they happen much faster with some disk activity. For instance, if I read all the disks in parallel (cat /dev/sdX > /dev/null), I can crash the machine in under an hour. However, SMART data shows nothing wrong. Here the output of smartctl -a /dev/sdb (the other disks look the same):

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   132   132   054    Pre-fail  Offline      -       112
  3 Spin_Up_Time            0x0007   160   160   024    Pre-fail  Always       -       401 (Average 420)
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   140   140   020    Pre-fail  Offline      -       15
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       7274
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
 22 Helium_Level            0x0023   100   100   025    Pre-fail  Always       -       100
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       260
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       260
194 Temperature_Celsius     0x0002   224   224   000    Old_age   Always       -       29 (Min/Max 10/46)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

So the crashes are somehow related to disks, but I don't know how.

csdt
  • 1
  • 3
    what is the result of running https://www.memtest.org/ overnight? – sourcejedi Apr 28 '19 at 12:10
  • @sourcejedi I currently don't have a physical access to the machine, but I could get one next week. On the mean time, I can say that I didn't change anything to the RAM configuration while moving or after. – csdt Apr 28 '19 at 14:21
  • How annoying! Did you have to open it as part of reconfiguring it (e.g. changing hard drives?), or are we just talking about a move and maybe different external environment? – sourcejedi Apr 28 '19 at 14:30
  • @sourcejedi Yes, I opened it to install the hard drives and redo some cabling. But I let the RAM modules and the CPU plugged in. The external environment obviously changed, and the machine is now on a Eaton UPS. I kept the OS, but configured the RAID (first with mdadm and then with zfs), installed and configured apache2, transmission-daemon and the UPS driver. – csdt Apr 28 '19 at 15:30
  • 1
    sounds like hardware issue and likely a RAM issue. memtest is the way for now on. – Kiwy Apr 29 '19 at 09:09
  • Another vote for memory. Physically re-seat the modules (i..e check they're firmly pushed in, and not slipped due to your machine move) and check with memtest. – Chris Davies Apr 29 '19 at 19:24
  • I just ran the original memtest86 before and after re-seating the RAM modules: no error found. I've started an extensive run of memtest86+ planned to run for roughly 12 hours. – csdt May 04 '19 at 17:59
  • Besides RAM, another frequent cause is overheating, so install lm-sensors if you haven't. I recently had similar problems with bad RAM that wasn't flagged as bad with memtest86+; however, replacing it with new RAM fixed it. So there are cases when memtest86+ doesn't spot bad RAM. – dirkt May 04 '19 at 18:50
  • memtest86 may not be able to dected all errors. Recommended read: The Sig11 problem (which applies to random crashes also). – ckujau May 04 '19 at 19:55
  • I don't have a monitor for CPU temperature, but I can tell you this: during memtest, the CPU is around 40°C, and the crashes happenned when the machine was mostly idle. The HDDs and the ACPI thermal zone are around 25°C. However, as you insist on RAM, I remember that XMP is enabled, and the 4 modules are not exactly the same, but are from the same brand and the exact same timings and same frequencies. Plus, before moving, the RAM was the same, and the server ran for half a year, with multiple minecraft server running (memory hungry). So I don't get why the problem arose after the move. – csdt May 04 '19 at 20:02
  • Reseat CPU and apply new cooling paste. If possible remove half and/or all minus one of RAM modules. Vary the one module across all slots; then one by one add the other modules. Was the CPU ever overclocked? Check BIOS. – Roadowl May 05 '19 at 00:36
  • @Roadowl The CPU cooling is fine, CPU itself is under 45°C with heavy load (kernel compilation). It has been overclocked at 4.2 GHz but has been thoroughly stress-tested for stability at the time, plus I also tried to put back 3.8 GHz, with no effect. I cannot properly test modules configuration as the crashes take such a long time to happen and I don't always have a physical access to the machine. – csdt May 05 '19 at 20:41

1 Answers1

0

Looking at the logs, the kernel is tainted, or running in an unsupported state:

Tainted: P IO

A list of taint flags is available in the kernel documentation. The P and O parts indicate non-GPL-compatible licensed, externally-built kernel modules; most notably, there are ZFS and related modules listed there as such. One of the log snippets you provided indicates that a general protection fault occurred in the ZFS module, but the rest are elsewhere in the kernel. Furthermore, GPFs and double faults are generated by the processor itself, which means that the modules are probably not at fault here.

What I'm more concerned about is the I taint flag. The I flag means "workaround for bug in platform firmware applied". This points to a potentially serious issue with your system's UEFI/BIOS firmware which could be causing the errors. Did you perform a BIOS update before this began, and was this flag set before you made the hardware upgrades?

Unfortunately, the links to your full logs no longer work, so I can't really give more specific assistance. The full logs will likely provide details as to what firmware bug the system is working around, as well as other possible indicators of trouble.

bwDraco
  • 3,273
  • 3
  • 16
  • 16