4

Recently, I encountered a postfix mail server being unresponsive due to high load, i.e. a high CPU usage.

That means console and ssh logins didn't complete due to the load.

Monitoring showed 100 % CPU usage, while I/O still being moderate.

After a hard reset the last log lines from the system journal read:

Feb 06 09:37:16 example.org postfix/master[10461]: warning: unix_trigger_event: read timeout for service private/tlsmgr
Feb 06 10:04:58 example.org kernel: systemd-journal invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-250
[..]
Feb 06 10:04:59 example.org kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/dnf-makecache.service,task=dnf,pid=1584309,uid=0
Feb 06 10:04:59 example.org kernel: Out of memory: Killed process 1584309 (dnf) total-vm:944600kB, anon-rss:817340kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:1772kB oom_score_adj:0
[..]
Feb 06 10:04:56 example.org dovecot[10353]: master: Warning: Time moved forwards by 5.880984 seconds - adjusting timeouts.
[..]
Feb 06 09:41:36 example.org systemd-networkd[10273]: enp0s3: Failed
[^--- sic, yes, shows up out-of-order in journalctl output]
Feb 06 10:04:56 example.org postfix/master[10461]: warning: unix_trigger_event: read timeout for service public/pickup
[..]
Feb 06 10:04:56 example.org dovecot[10353]: master: Error: service(imap-login): Initial status notification not received in 30 seconds, killing the process
Feb 06 10:04:56 example.org dovecot[10362]: imap-login: Fatal: master: service(imap-login): child 1584458 killed with signal 9
Feb 06 10:04:56 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Broken pipe
[.. many repeats of above message ..]
Feb 06 10:05:03 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Broken pipe
Feb 06 10:05:03 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
[.. many repeats of above message ..]
Feb 06 11:22:01 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
Feb 06 11:22:01 example.org postfix/tlsmgr[10953]: warning: end-of-input while reading request from tlsmgr socket: Application error
Feb 06 11:22:27 example.org systemd-journald[10176]: [] Suppressed 116313 messages from postfix.service
[^--- last message]

(Fedora 37, postfix-3.6.4-1.fc36.x86_64)

It looks like a similar issue reported a year ago to the postfix mailinglist that unfortunately didn't receive any replies.

It also looks like the initial issue is the system being short on memory such that the OOM killer is invoked. The OOM killer doesn't really help with the system being unresponsive, such that different processes run into edge cases (e.g. systemd-networkd seems to take down the interface due to an timeout, however, the system was still reachable via ping).

Apparently, because of all the OOM mess, the tlsmgr process goes into an infinite tight loop that also floods the system journal which also doesn't help with recovering system responsiveness ..

So is there a way to let postfix behave more gracefully/robustly in such a situation?

And how can I protect the server against such a denial of service?


See also:

maxschlepzig
  • 57,532
  • I personally think that is something to fix in postfix, as such logs quickly fill up the log storage. – dma_k Dec 25 '23 at 00:35

1 Answers1

1

The root cause is the system running out of memory and thus, to make the system more robust, one should decrease the probability of that event. Especially by:

  • limiting the memory usage of core services, e.g. by limiting the number of concurrent database connections in the - say - Golang DB connection pool, limiting attachment sizes in the mail server, setting application memory/cache limits, etc.
  • disabling superfluous services
  • possibly adding more RAM to that system

In addition, one might work on improving the failure mode under near OOM conditions and (relatively) low memory situation of involved components, such as:

  • improving dnf (or microdnf, the slated dnf successor) such that it is usable on systems with less than 2 GiB RAM - and systems which have less than 1 GiB free memory available for such a house-keeping utility, which should consume much much less memory than 800 MiB for routine tasks
  • improving systemd-networkd such that it doesn't take down network interfaces in OOM situations
  • improving Postfix such that tlsmgr and its clients recover quickly from an near OOM situation, e.g. by giving up on outdated epoll events instead of flooding and hogging the system with superfluous error messages and processing
maxschlepzig
  • 57,532