Yes, I know this question has been asked tens if not hundreds of times before. Still, I went through all similar questions and tried everything listed on them, to no avail.
After compiling some code on my Raspberry Pi 4 model B running Ubuntu 21.04 (Linux rpi4 5.11.0-1017-raspi #18-Ubuntu SMP PREEMPT Mon Aug 23 07:34:31 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
), ccache got stuck and it's been running for almost an hour at 100% CPU. Here's the output of ps -l
for the offending process:
$ ps -l -p 7580
F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
1 R 1000 7580 1 99 80 0 - 1725 - pts/2 00:54:10 ccache
I tried to kill
it and to kill -9
it. No effect. It doesn't appear to be a zombie; it's running:
$ sudo cat /proc/7580/syscall
running
I tried to attach to it with strace
and gdb
, and both hang.
I tried to find the parent processes (as indicated by ps auxf
) and killing all parents, and then trying to kill the offending process again. It didn't work.
I ran demsg
and looked at /var/log/syslog
, looking back until the time I started the process. I found no clues to help me debug it.
Normally I'd just reboot and get on with life, but this is the third reboot of the day (two due to ccache, and another due to a shell script called cpuUsage.sh
remotely installed by VS Code), and I suspect this will become the norm from now on. I've had this board for a couple of months, and this hadn't happened until today.
My only reasonable, yet unsubstantiated, hypothesis is that the SD card the board is booting from may be bad, but I have no idea how to diagnose this.
Although I'd love to be told a magic command that kills this process, I'm fairly sure there is no such thing, given all that I've tried until now. My question is: assuming this continues to happen, how can I diagnose this? It's evidently unsustainable to keep rebooting this board multiple times daily, as I suspect I may have to do from now on.
EDIT: following a suggestion on the comments, I tried the following while looking at dmesg
output:
$ sudo dd if=/dev/mmcblk0p2 of=/dev/null bs=1M
60648+1 records in
60648+1 records out
63595068928 bytes (64 GB, 59 GiB) copied, 1386,27 s, 45,9 MB/s
Saw this on the dmesg
output:
[27430.135999] INFO: task kworker/3:2:12138 blocked for more than 120 seconds.
[27430.136031] Tainted: G C OE 5.11.0-1017-raspi #18-Ubuntu
[27430.136041] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[27430.136050] task:kworker/3:2 state:D stack: 0 pid:12138 ppid: 2 flags:0x00000008
[27430.136067] Workqueue: events_freezable mmc_rescan
[27430.136088] Call trace:
[27430.136092] __switch_to+0xb8/0xe4
[27430.136102] __schedule+0x2bc/0x7dc
[27430.136110] schedule+0x7c/0x110
[27430.136117] __mmc_claim_host+0xc0/0x1f0
[27430.136124] mmc_get_card+0x40/0x50
[27430.136130] mmc_sd_detect+0x2c/0xa0
[27430.136136] mmc_rescan+0xc8/0x314
[27430.136143] process_one_work+0x200/0x4f0
[27430.136151] worker_thread+0x74/0x3c0
[27430.136158] kthread+0x12c/0x140
[27430.136164] ret_from_fork+0x10/0x3c
Given the presence of SD card-related functions on the stack trace, this seems to confirm my suspicion of a bad SD card.
hypothesis is that the SD card the board is booting from may be bad, but I have no idea how to diagnose this
- if you suspect bad hardware, then try replacing it. Fortunately, SD cards are cheap. Also, see if you can trigger the problem withoutccache
by repeatedly runningcat /dev/sdX > /dev/null
(where sdX is the device node for your SD card). maybe rundmesg -w &
before the cat loop, so you can see any kernel error message. If the machine locks up while you're cloning your current SD card, that's also a pretty good indicator. – cas Sep 25 '21 at 07:02kill -9
is a kernel bug or a hardware failure. When it's a kernel bug, usually the bug is in some driver and the process is state D, in a syscall, and sometimes there's a way to unblock it with some indirect action on the driver (e.g. disconnecting a peripheral). A bad SD card could be the reason if the kernel image is corrupted on it; the very first thing to try is to replace the SD card, obviously not by copying from the existing card (since you'd be copying the bad image). – Gilles 'SO- stop being evil' Sep 25 '21 at 07:31