0

Sometimes the kernel uses up 40-50% of my CPU, causing lag in other programs. At the times this happens, the output of iotop looks like this:

Total DISK READ :       2.96 K/s | Total DISK WRITE :    1552.86 K/s
Actual DISK READ:       2.96 K/s | Actual DISK WRITE:     103.72 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                                                                      
17631 be/4 root        0.00 B/s    0.00 B/s  0.00 %  6.99 % [kworker/5:2]
15770 be/4 root        0.00 B/s    0.00 B/s  0.00 %  6.96 % [kworker/5:0]
21092 be/4 root        0.00 B/s    0.00 B/s  0.00 %  4.50 % [kworker/7:3]
23201 be/4 root        0.00 B/s    0.00 B/s  0.00 %  4.48 % [kworker/7:2]
19368 be/4 root        0.00 B/s    0.00 B/s  0.00 %  3.07 % [kworker/4:0]
20876 be/4 root        0.00 B/s    0.00 B/s  0.00 %  3.05 % [kworker/4:3]
14505 be/4 fabian      0.00 B/s    2.96 K/s  0.00 %  2.47 % cinnamon --replace
 9172 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.77 % [kworker/1:1]
 4149 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.76 % [kworker/1:3]
14752 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.25 % [kworker/2:2]
15418 be/4 root        0.00 B/s    0.00 B/s  0.00 %  1.24 % [kworker/2:0]
23131 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.73 % [kworker/0:3]
22790 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.73 % [kworker/0:2]
  236 be/3 root        0.00 B/s 1229.84 K/s  0.00 %  0.13 % [jbd2/sda2-8]
 1625 be/4 fabian      2.96 K/s  308.20 K/s  0.00 %  0.12 % firefox-esr [Cache2 I/O]
22086 be/4 fabian      0.00 B/s    0.00 B/s  0.00 %  0.01 % java -Xss1M -Djava.library.path=/home/f~ng --versionType release [Render thread]

(and then a bunch of 0.00% stuff)

This tells me that this CPU usage is caused by a lot of "IO" of the system, which is pretty vague and unspecific. It's also weird how "total disk write" is so much more than "actual disk write". Is there something like "not actual disk write, just kidding"?

But more importantly: How can I find out more about what causes this high CPU usage? What has triggered these kworkers to do so much work?

Of course I also tried simply looking at the task manager, which doesn't list any kworkers, and top, which does list some, but shows their CPU usage to be very low. (I only got the hint that kernel CPU usage is high because of a CPU usage display applet that has a dedicated colour for kernel.)
Swap is not used at those times and RAM, SSD usage and upload+download are nothing remarkable.
I have already pretty much ruled out a relation to WLAN, LAN, BlueTooth, a second screen, graphics card (about 12% used), other programs running in the background and some more things that were suggested in other places (I don't quite remember all of them) by disabling them. I also don't use any exotic setup like RAID, it's just a laptop running Debian 9.11 (Linux kernel 4.9.0-11-amd64).
Editing /proc/sys/vm/drop_caches according to this answer makes it feel like the times in which the kernel uses that much CPU are shorter now, but I'm not sure and it doesn't tell me the cause. Please tell me if that was a horrible idea for some reason, because I don't really know what exactly it does.

Update: Apparently it was not the graphics card that was 12% busy, but the graphics chip integrated into the CPU. My graphics card wasn't working at all and while trying to get it working, I messed up the system. I took that as an opportunity to switch from Debian to Manjaro and from Cinnamon to Mate. Now the graphics card works and I don't have the problem of this question anymore, but I would of course still like to know how a problem like that could be debugged better.

  • @kaylum No. The first answer assumes that I know which kworker is the problem, but that seems to change a lot and quickly in my case. For example the one on top in iotop was not even in the list of that logging file. And the first command in the second answer just results in grep: /proc/24910/stack: No such file or directory. – Fabian Röling Jan 11 '20 at 08:46
  • The 24910 was an example PID that the user had, you should use say 17631 from your list. Something that would be interesting is the source of interrupts, can you get the information in /proc/interrupts twice, 10 seconds apart, and post the biggest changes? – icarus Jan 11 '20 at 10:37
  • @icarus OK, but what does this output tell me? https://pastebin.com/S9RP3ZHR [<ffffffff9ac958c6>] process_one_work+0x1b6/0x430 [<ffffffff9ac95c08>] worker_thread+0xc8/0x490 [<ffffffff9ac95b40>] process_one_work+0x430/0x430 [<ffffffff9ac9bbf9>] kthread+0xd9/0xf0 [<ffffffff9b21e4f1>] __switch_to_asm+0x41/0x70 [<ffffffff9ac9bb20>] kthread_park+0x60/0x60 [<ffffffff9b21e577>] ret_from_fork+0x57/0x70 [<ffffffffffffffff>] 0xffffffffffffffff – Fabian Röling Jan 13 '20 at 09:21
  • About the interrupts: https://pastebin.com/ckBZZ6Ma The biggest differences seem to be where the biggest numbers are, 136, 137 and LOC. Again, I have no idea what any of this means. – Fabian Röling Jan 13 '20 at 09:29
  • This answer suggests to run "syscall profiling tools to track down specific applications that seem to be causing this overload". That sounds exactly like what I want, but I can only find tools that tell me which parts of the system were called, which is all pretty cyrptic, instead of what made those system calls. – Fabian Röling Jan 13 '20 at 13:52
  • It seems to happen surprisingly rarely recently, but it still happens. I've changed nothing about my setup. – Fabian Röling Jan 17 '20 at 15:59

0 Answers0