107

Note: I wrote an article on Medium that explains how to create a service, and how to avoid this particular issue: Creating a Linux service with systemd.

Original question:


I'm using systemd to keep a worker script working at all times:

[Unit]
Description=My worker
After=mysqld.service

[Service]
Type=simple
Restart=always
ExecStart=/path/to/script

[Install]
WantedBy=multi-user.target

Although the restart works fine if the script exits normally after a few minutes, I've noticed that if it repeatedly fails to execute on startup, systemd will just give up trying to start it:

Jun 14 11:10:31 localhost systemd[1]: test.service: Main process exited, code=exited, status=1/FAILURE
Jun 14 11:10:31 localhost systemd[1]: test.service: Unit entered failed state.
Jun 14 11:10:31 localhost systemd[1]: test.service: Failed with result 'exit-code'.
Jun 14 11:10:31 localhost systemd[1]: test.service: Service hold-off time over, scheduling restart.
Jun 14 11:10:31 localhost systemd[1]: test.service: Start request repeated too quickly.
Jun 14 11:10:31 localhost systemd[1]: Failed to start My worker.
Jun 14 11:10:31 localhost systemd[1]: test.service: Unit entered failed state.
Jun 14 11:10:31 localhost systemd[1]: test.service: Failed with result 'start-limit'.

Similarly, if my worker script fails several times with an exit status of 255, systemd gives up trying to restart it:

Jun 14 11:25:51 localhost systemd[1]: test.service: Failed with result 'exit-code'.  
Jun 14 11:25:51 localhost systemd[1]: test.service: Service hold-off time over, scheduling restart.  
Jun 14 11:25:51 localhost systemd[1]: test.service: Start request repeated too quickly.  
Jun 14 11:25:51 localhost systemd[1]: Failed to start My worker.  
Jun 14 11:25:51 localhost systemd[1]: test.service: Unit entered failed state.  
Jun 14 11:25:51 localhost systemd[1]: test.service: Failed with result 'start-limit'.

Is there a way to force systemd to always retry after a few seconds?

BenMorel
  • 4,587

5 Answers5

105

I would like to extend Rahul's answer a bit.

systemd tries to restart multiple times (StartLimitBurst) and stops trying if the attempt count is reached within StartLimitIntervalSec. Both options belong to the [unit] section.

The default delay between executions is 100ms (RestartSec) which causes the rate limit to be reached very fast.

systemd won't attempt any more automatic restarts ever for units with Restart policy defined:

Note that units which are configured for Restart= and which reach the start limit are not attempted to be restarted anymore; however, they may still be restarted manually at a later point, from which point on, the restart logic is again activated.

Rahul's answer helps, because the longer delay prevents reaching the error counter within the StartLimitIntervalSec time. The correct answer is to set both RestartSec and StartLimitBurst to reasonable values though.

Will S
  • 103
MarSik
  • 1,166
  • 13
    Now that I (finally) understand how it works, after some trial-and-error, I can see that your answer is the most correct. Bottom line for me: set StartLimitIntervalSec=0 and voilà. – BenMorel Aug 26 '17 at 15:42
  • You don't actually need StartLimitBurst. You can leave it out. You only need what @BenMorel said and StartLimitIntervalSec to limit CPU load when it retries to start your script. – Binar Web Mar 09 '21 at 15:02
  • Unknown lvalue 'StartLimitIntervalSec' in section 'Unit' – geoidesic Jul 19 '23 at 14:44
70

Yes, there is. You can specify to retry after x seconds under [Service] section,

[Service]
Type=simple
Restart=always
RestartSec=3
ExecStart=/path/to/script

After saving the file you need to reload the daemon configurations to ensure systemd is aware of the new file,

systemctl daemon-reload

then restart the service to enable changes,

systemctl restart test

As you have requested, Looking at the documentation,

Restart=on-failure

sounds like a decent recommendation.

Rahul
  • 13,589
  • It seems to work indeed, thank you! So to understand this better, without a RestartSec directive, systemd attempts severals restarts very quickly, then enters a permanent failure state; something that cannot happen when RestartSec is specified? – BenMorel Jun 14 '16 at 09:48
  • Also, I've noticed that it delays the "normal" restart of my worker (I'm purposefully exiting the worker gracefully after a few minutes); is there a way to only delay a failed restart? – BenMorel Jun 14 '16 at 09:50
  • @Benjamin see my updates – Rahul Jun 14 '16 at 09:53
  • @Benjamin you can check here for more parameters. – Rahul Jun 14 '16 at 09:57
  • 5
    Judging by the doc, always is a superset of on-failure, so it won't help! – BenMorel Jun 14 '16 at 10:10
  • No, I don't see any provision to delay a failed restart – Rahul Jun 14 '16 at 10:37
  • To finally answer myself: it's not RestartSec alone that prevents the start limit from kicking in, it's a combination of RestartSec, StartLimitIntervalSec and StartLimitBurst. See @MarSik's answer for a more thorough explanation. – BenMorel Aug 26 '17 at 15:43
  • Service has more than one ExecStart= setting, which is only allowed for Type=oneshot services. Refusing. – Vlad Dec 15 '19 at 09:00
  • systemd[1]: Unknown section 'Service'. Ignoring. – Time Killer Dec 03 '22 at 22:48
9

systemd gives up trying to restart it

No. systemd gives up trying to restart it for a little while. This is clearly shown in the log that you supply:

Jun 14 11:25:51 localhost systemd[1]: test.service: Failed with result 'start-limit'.

This is rate limiting kicking in.

The length of the little while is specified in the service unit, using the StartLimitIntervalSec= setting. The number of starts that are needed within that interval to trigger the rate limiting mechanism are specified via the StartLimitBurst= setting. If nothing on your system differs from vanilla systemd, including the defaults for these two settings, then it is 5 times within 10 seconds.

StartLimitIntervalSec=0 disables rate limiting, so systemd will retry forever rather than giving up. But making your service either not exit so often, or idle enough between exits and restarts that it does not exceed the rate limiting threshold, is a better approach.

Note that rate limiting does not care how your service exited. It triggers on the number of attempts to start/restart it, irrespective of their cause.

Further reading

Raedwald
  • 144
JdeBP
  • 68,745
  • 10
    It does seem to give up permanently, though: "Active: failed (Result: start-limit) since Wed 2016-06-15 01:21:24 CEST; 12h ago". It stays in this state and the script is never executed again. I tried setting manually StartLimitIntervalSec=10 and StartLimitIntervalSec=5, no luck. – BenMorel Jun 15 '16 at 11:55
  • 10
    It does give up permanently by default. See https://github.com/systemd/systemd/issues/2416. – Adam Goode Apr 22 '17 at 17:10
  • 6
    Bottom line: to stop prevent it from giving up permanently, set StartLimitIntervalSec=0. – BenMorel Aug 26 '17 at 15:43
2

Add a StartLimitIntervalSec=0 directive to the [Unit] block of the service file.

Also you can add RestartSec=3 to the [Service] section to change the retry delay to 3 seconds.

[Unit]
Description=My worker
After=mysqld.service
StartLimitIntervalSec=0 # Added

[Service] Type=simple Restart=always RestartSec=3 # Added ExecStart=/path/to/script

[Install] WantedBy=multi-user.target

(source)

Finesse
  • 121
1

If your service is not restarting after reboot please insure you have enabled it in-before:

sudo systemctl enable your.service