MAAS

consider adding startsecs in supervisor "programs" configuration

Bug #1871582 reported by Junien F on 2020-04-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	Medium	Alberto Donato	MAAS 2.8.0b1
	2.7	Fix Released	Medium	Alberto Donato	MAAS 2.7.1b1

Bug Description

Hi,

Using MAAS snap 2.7.0-8235-g.fea3a1678. I'm mostly looking at the bind9 supervisor program/service/unit, you name it.

This supervisor service launches the shell script "run-named". This shell script copies a few files from the snap to $SNAP_DATA, then runs "$SNAP/bin/maas-rack setup-dns" and finally starts named.

By default, the supervisor units have a "startsecs" of 1s, so 1s after supervisor starts "run-named", it considers the service to be "RUNNING", as per the following log from supervisor-run.log :

INFO success: bind9 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

However, on my system, which runs only MAAS (and its database), has 20 2.40GHz cores, 32GB RAM, hardware RAID1 on spinning rust, "maas-rack setup-dns" takes approximately 2s to run (https://pastebin.ubuntu.com/p/8YzG5km4qt/ - this is run in "sudo snap run --shell maas.supervisor" by the way).

What this means is that there's at least a 1s period during which the "bind9" supervisor service is "RUNNING" but without named actually running.

Now MAAS' service_monitor comes into play. It considers (for good reasons ?) that named is running as soon as the supervisor service is marked as RUNNING. MAAS (not sure which part exactly) then tries to query named or reload it via rndc, but since named is possibly not running, rndc can get a "connection refused" - maas.log shows :

maas.dns: [error] Reloading BIND failed (is it running?): Command `rndc -c /var/snap/maas/5229/bind/rndc.conf.maas reload` returned non-zero exit status 1:#012rndc: connect failed: 127.0.0.1#954: connection refused

Then MAAS tries to send a SIGKILL to terminate named (see bug 1710278 for context about that), and logs :

maas.service_monitor: [error] Service 'bind9' failed to kill:

which I think happens because it can't find the PID of named, since it's not running. regiond logs show https://pastebin.ubuntu.com/p/W8dxb6vHpZ/

A workaround that comes to mind immediately is to increase startsecs for the bind9 service. I set it to 10s, and it fixed this issue. However, choosing an appropriate value for startsecs may be complicated.

Other possibilities include : check if named is actually running before trying to query it, have a supervisor service that _just_ runs named and somehow run "maas-rackd setup-dns" somewhere else, or drop supervisor altogether and use snap services https://snapcraft.io/docs/service-management

Related branches

~ack/maas:1871423-supervisord-backoff-2.7

Merged into maas:2.7

Alberto Donato (community): Approve on 2020-04-08

~ack/maas:1871423-supervisord-backoff

Merged into maas:master

Björn Tillenius: Approve on 2020-04-08

MAAS Lander: Approve on 2020-04-08

Alberto Donato (ack) on 2020-04-08

Changed in maas:
assignee:	nobody → Alberto Donato (ack)
status:	New → In Progress
importance:	Undecided → Medium
milestone:	none → 2.8.0b1

MAAS Lander (maas-lander) on 2020-04-08

Changed in maas:
status:	In Progress → Fix Committed

Alberto Donato (ack) on 2020-04-17

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.