14

This is quite related to this question, but since it does not really have any satisfactory answers I figured I could ask a new question.

This screenshot shows htop indicating one core with 100% utilization, but with no process using any large amount of cpu:

htop with one core at 100% and process much lower

I assume this means that the kernel is using this much cpu for some unknown reason, but I haven't found a very good way of investigating this. (Looking into using eBPF for this now) I thought it might have something to do with my disk encryption and disk access, but iotop does not show any significant disk usage. I am running Arch Linux with a completely standard kernel.

The problem has appeared a couple of times lately and always goes away if I reboot, and always takes at least a couple of hours of on-time to appear.

Any ideas and suggestions for how to debug this or what the underlying cause could be would be very welcome.

Edit:

So this new screenshot shows htop set to display both kernel and user threads, but there is still no clear explanation for the high cpu usage:

htop with kernel and user threads

Edit 2:

Following screenshot shows results from bfptrace when running bpftrace -e 'profile:hz:99 /cpu == 0/ { @[kstack] = count(); }'. It seems that the kernel is spending a lot of time in acpi_os_execute_deferred for some reason.

enter image description here

  • 2
    Set htop to show all threads, including kernel threads (which it hides by default). Then you'll see what's causing it. Because it's either a process (whether in user or kernel mode at the time) or a true kernel thread (like a kworker). Nothing else would cause a high CPU usage to be reported there. – forest Jun 07 '21 at 22:42
  • 1
    @forest so I did this with shit-H and shift-k (see new screenshot), but there doesn't see too be any user or kernel threads that use any significant amount of cpu. – NatureShade Jun 07 '21 at 23:03
  • Oh, that is odd. The only thing I can think of now is something like a bug in system management mode, but I doubt that's it. One thing I'd do: try using cpuset to isolate your cores and prevent processes from being scheduled to CPU 0. Does it still use 100% in that situation? – forest Jun 07 '21 at 23:14
  • @forest so I tried this with sudo cset sheild --cpu=0 (I might have misunderstood how to use this command), however I got the message cset: **> 1 tasks are not movable, impossible to move and cset set -l lists a couple of tasks in the root set using cpus 0-7 and it is still at 100% – NatureShade Jun 07 '21 at 23:46
  • There's a kernel boot parameter which you can set in your bootloader which restricts the CPUs which is more effective than doing it at runtime since tasks don't need to be migrated. Since you say this happens even after you reboot (after some time), is it always the same CPU or does it vary? – forest Jun 07 '21 at 23:49
  • I could do that, but it does not seem very deterministic. It usually takes about 6-7 hours before the problem appears and not on every boot (so it is not exactly the end of the world, but it would be nice to find a cause). I haven't really been keeping track, but maybe 2-3 times a week (I boot about once a day). I believe it is always CPU 0, but I am not entirely sure. I am going to try to get bpftrace to work for a little while (to do a kernel stack trace), but if I cant I will give up for today and have another look next time the problem appears. – NatureShade Jun 08 '21 at 00:16
  • You may be able to use ftrace for this more easily than using bpf. – forest Jun 08 '21 at 00:18
  • Newest edit shows some results from bfptrace, but they don't seem to make too much sense to me. Most time seems to be spent spent going into an idle state and a bunch is spent on ACPI functions. Why does this cause 100% usage, and why is so much time spent in the ACPI functions? And since this seems to be in a kernel thread, should it not be seen by htop? – NatureShade Jun 08 '21 at 00:40
  • Can you find out what kernel thread it's in, even if htop won't show it? I admit, my perf testing skills are very poor, and my knowledge of ACPI is even lower. Only thing I know that differs based on hardware in ACPI is AML (ACPI Machine Code). I'm not sure if it would be even relevant though, since I don't think it's executed very often (only on certain hardware events). – forest Jun 08 '21 at 00:41
  • Yes, with bpftrace -e 'kprobe:acpi_ps_parse_aml /cpu == 0/ { printf("%d\n", tid); }' I get that the kernel thread id is 124638. Don't seem to find anything when searching for it in htop though That was wrong, it seems to be kworker/0:3-kacpi_notify, but htop says it uses 0 cpu. – NatureShade Jun 08 '21 at 00:48
  • @forest thank you for all the help! I finally found the problem now. Or at least the cause of the high CPU usage, the fact that htop reported no usage for this thread continues to be a mystery, could this be a bug in the kernel? – NatureShade Jun 08 '21 at 01:16
  • Have you set htop to show detailed usage information, including soft and hard IRQs? Also from acpi_ps_parse_aml it looks like my earlier comment was right, it was AML. I guess it was getting executed anew each time the interrupt fired, and the hardware was spamming your CPU with interrupts. – forest Jun 08 '21 at 01:17
  • Is your kernel built with CONFIG_IRQ_TIME_ACCOUNTING set? Without it, interrupt time will not be accounted by the kernel. If ACPI is spamming interrupts, then without that option set, the kernel isn't tracking the time so it has nothing to give to htop to report to you. – forest Jun 08 '21 at 01:24
  • zcat /proc/config.gz | grep CONFIG_IRQ_TIME_ACCOUNTING returns CONFIG_IRQ_TIME_ACCOUNTING=y so it would seem so. Don't know how to set htop to show more detailed information, couldn't find anything about this in the man page. – NatureShade Jun 08 '21 at 01:38
  • F2 is setup. Scroll to "Display options". You can set it to show kernel threads and show more detailed CPU time. – forest Jun 08 '21 at 01:40
  • Most of the CPU bar becomes yellow it seems, indicating the time is spent in a IRQ, but the kernel thread itself still doesn't show any significant usage unfortunately. – NatureShade Jun 08 '21 at 01:48

1 Answers1

9

Finally found the answer. It turns out that the issue is the same as this question with additional information here and here. None of these mention the problem with htop showing zero usage though so that might be a unrelated problem.

As explained in the links above the answer was to use sudo grep . -r /sys/firmware/acpi/interrupts/ and then using echo "disable" /sys/firmware/acpi/interrupts/gpe6D to disable the problematic interupt (the one with the biggest number attached, in my case gpe6D).

To figure out that this was the problem I used bfptrace as explained in the question to do a kernel stack trace and figure out where the cpu was spending time and then bpftrace -e 'kprobe:acpi_ps_parse_aml /cpu == 0/ { printf("%d\n", tid); }' to find the kernel thread ID of one of the listed functions. Turns out the offending thread was kworker/0:3-kacpi_notify and then from some googling it turns out others have also had similar problems with this kernel thread.