consider adding startsecs in supervisor "programs" configuration

Bug #1871582 reported by Junien F
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Medium
Alberto Donato
2.7
Fix Released
Medium
Alberto Donato

Bug Description

Hi,

Using MAAS snap 2.7.0-8235-g.fea3a1678. I'm mostly looking at the bind9 supervisor program/service/unit, you name it.

This supervisor service launches the shell script "run-named". This shell script copies a few files from the snap to $SNAP_DATA, then runs "$SNAP/bin/maas-rack setup-dns" and finally starts named.

By default, the supervisor units have a "startsecs" of 1s, so 1s after supervisor starts "run-named", it considers the service to be "RUNNING", as per the following log from supervisor-run.log :

INFO success: bind9 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

However, on my system, which runs only MAAS (and its database), has 20 2.40GHz cores, 32GB RAM, hardware RAID1 on spinning rust, "maas-rack setup-dns" takes approximately 2s to run (https://pastebin.ubuntu.com/p/8YzG5km4qt/ - this is run in "sudo snap run --shell maas.supervisor" by the way).

What this means is that there's at least a 1s period during which the "bind9" supervisor service is "RUNNING" but without named actually running.

Now MAAS' service_monitor comes into play. It considers (for good reasons ?) that named is running as soon as the supervisor service is marked as RUNNING. MAAS (not sure which part exactly) then tries to query named or reload it via rndc, but since named is possibly not running, rndc can get a "connection refused" - maas.log shows :

maas.dns: [error] Reloading BIND failed (is it running?): Command `rndc -c /var/snap/maas/5229/bind/rndc.conf.maas reload` returned non-zero exit status 1:#012rndc: connect failed: 127.0.0.1#954: connection refused

Then MAAS tries to send a SIGKILL to terminate named (see bug 1710278 for context about that), and logs :

maas.service_monitor: [error] Service 'bind9' failed to kill:

which I think happens because it can't find the PID of named, since it's not running. regiond logs show https://pastebin.ubuntu.com/p/W8dxb6vHpZ/

A workaround that comes to mind immediately is to increase startsecs for the bind9 service. I set it to 10s, and it fixed this issue. However, choosing an appropriate value for startsecs may be complicated.

Other possibilities include : check if named is actually running before trying to query it, have a supervisor service that _just_ runs named and somehow run "maas-rackd setup-dns" somewhere else, or drop supervisor altogether and use snap services https://snapcraft.io/docs/service-management

Related branches

Alberto Donato (ack)
Changed in maas:
assignee: nobody → Alberto Donato (ack)
status: New → In Progress
importance: Undecided → Medium
milestone: none → 2.8.0b1
Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.