Recently, I encountered a postfix mail server being unresponsive due to high load, i.e. a high CPU usage.
That means console and ssh logins didn't complete due to the load.
Monitoring showed 100 % CPU usage, while I/O still being moderate.
After a hard reset the last log lines from the system journal read:
Feb 06 09:37:16 example.org postfix/master[10461]: warning: unix_trigger_event: read timeout for service private/tlsmgr
Feb 06 10:04:58 example.org kernel: systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
[..]
Feb 06 10:04:59 example.org kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/dnf-makecache.service,task=dnf,pid=1584309,uid=0
Feb 06 10:04:59 example.org kernel: Out of memory: Killed process 1584309 (dnf) total-vm:944600kB, anon-rss:817340kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1772kB oom_score_adj:0
[..]
Feb 06 10:04:56 example.org dovecot[10353]: master: Warning: Time moved forwards by 5.880984 seconds - adjusting timeouts.
[..]
Feb 06 09:41:36 example.org systemd-networkd[10273]: enp0s3: Failed
[^--- sic, yes, shows up out-of-order in journalctl output]
Feb 06 10:04:56 example.org postfix/master[10461]: warning: unix_trigger_event: read timeout for service public/pickup
[..]
Feb 06 10:04:56 example.org dovecot[10353]: master: Error: service(imap-login): Initial status notification not received in 30 seconds, killing the process
Feb 06 10:04:56 example.org dovecot[10362]: imap-login: Fatal: master: service(imap-login): child 1584458 killed with signal 9
Feb 06 10:04:56 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Broken pipe
[.. many repeats of above message ..]
Feb 06 10:05:03 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Broken pipe
Feb 06 10:05:03 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
[.. many repeats of above message ..]
Feb 06 11:22:01 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
Feb 06 11:22:01 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
Feb 06 11:22:27 example.org systemd-journald[10176]: [] Suppressed 116313 messages from postfix.service
[^--- last message]
(Fedora 37, postfix-3.6.4-1.fc36.x86_64)
It looks like a similar issue reported a year ago to the postfix mailinglist that unfortunately didn't receive any replies.
It also looks like the initial issue is the system being short on memory such that the OOM killer is invoked. The OOM killer doesn't really help with the system being unresponsive, such that different processes run into edge cases (e.g. systemd-networkd seems to take down the interface due to an timeout, however, the system was still reachable via ping).
Apparently, because of all the OOM mess, the tlsmgr process goes into an infinite tight loop that also floods the system journal which also doesn't help with recovering system responsiveness ..
So is there a way to let postfix behave more gracefully/robustly in such a situation?
And how can I protect the server against such a denial of service?
See also:
- Fedora's dnf having a memory usage issues such that since Fedora 36 it is basically unusable on Cloud instances with < 2 GiB RAM and small single board computers: Bug 1907030 - "dnf update" runs out of memory on swapless machines with 1G or less of RAM (reported 2020, still active, open and unresolved in January, 2023)
- related Fedora devel mailing list thread: dnf makecache memory usage increase (July, 2022)