MAAS

named.conf.options.inside.maas reverts to default

Bug #1888536 reported by Ian Marsh on 2020-07-22

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Released	High	Björn Tillenius	MAAS 2.9.0b4
	2.8	Fix Released	High	Björn Tillenius	MAAS 2.8.2rc1

Bug Description

Default /var/snap/maas/current/bind/named.conf.options.inside.maas looks like this:

-----
dnssec-validation auto;

allow-query { any; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
-----

After changing the upstream_dns config setting to our upstream DNS, and turning off dnssec_validation (our upstream DNS is a bit broken in that regard), it changes to this:

-----
forwarders {
x.x.x.x;
};

dnssec-validation no;

allow-query { any; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
-----

At this point, DNS works fine (in this case, we have a Juju-deployed OpenStack, and all machines/containers can resolve).

You wait... time passes. Services start timing out, and you discover DNS no longer works in your machines/containers.

Looking at /var/snap/maas/current/bind/named.conf.options.inside.maas, it has reverted to the default settings. Checking the config (with the maas CLI) shows the settings are still correct.

A quick fix is possible by changing the config (we set dnssec_validation to "yes" and then back to "no") which regenerates the named.conf.options.inside.maas correctly and all is well again. However, this shouldn't be considered a workaround, as the random service-affecting outages are not acceptable to our users, despite the quick fix.

We have observed this on:
2.8.1-8567-g.c4825ca06
2.7.1-8261-g.5143564e6

We haven't seen this behaviour on our pre-snap system (2.4.2-7034-g2f5deb8b8).

In case it's relevant, on our 2.8.1/2.7.1 systems we're running dual region/rack controllers for redundancy, so we're also using an external postgres. Our 2.4.2 is a single region/rack controller.

I'm hoping this is reproducable elsewhere. Downloading logs from the affected systems is difficult, and I don't currently have access to them. If my logs are necessary, I will add them when I can.

Related branches

~bjornt/maas:bug-1888536-2.8

Merged into maas:2.8

MAAS Lander: Approve on 2020-08-03

Björn Tillenius: Approve on 2020-08-03

~bjornt/maas:dns-crash-issues

Merged into maas:master

Alberto Donato (community): Approve on 2020-08-03

Revision history for this message

Ian Marsh (drulgaard) wrote on 2020-07-27:

I've reproduced this on a test system to which I have full access.

At the time DNS starts failing, /var/snap/maas/common/log/named.log shows:

25-Jul-2020 09:44:34.333 ../../../lib/dns/rbtdb.c:1499: fatal error:
25-Jul-2020 09:44:34.333 RUNTIME_CHECK(rbtdb->next_serial != 0) failed
25-Jul-2020 09:44:34.333 exiting (due to fatal error in library)
25-Jul-2020 09:44:38.443 starting BIND 9.11.3-1ubuntu1.12-Ubuntu (Extended Support Version) <id:a375815>

So BIND crashes and is restarted - with a bad configuration.

I think the crash is actually due to changing the dnssec-validation option and reloading, and restarting instead of reloading prevents this crash. This means I should use a different 'quick fix'!

However, I can't find anything in the logs as to why the configuration file was changed, which is the real issue here.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-07-27:

Thanks for your bug report.

Yes, this looks like it's this issue upstream:

https://gitlab.isc.org/isc-projects/bind9/-/issues/784

I still don't know why the configuration files are being reverted. I'm still looking into it.

Reading the issue above, you might want to do a 'snap restart maas' do ensure that the config is right, and then do a second 'snap restart maas'. That should restart bind with the correct config, and that might work around the issue.

Changed in maas:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-07-28:

It's trivial to reproduce this by setting dnssec-validation to "no", and then do a 'kill -9' on the named process that maas started. Then you'll see the same behavior, that named.conf.options.inside.maas has reverted to the default values.

Revision history for this message

Björn Tillenius (bjornt) wrote on 2020-08-03:

We can't easily do anything about the upstream issue, but I'll fix MAAS so that it handles bind crashing a bit better, since we seem to have a bug there. That should prevent MAAS from putting the default config files in place, so bind should start to work as before after MAAS automatically restarts it.

Changed in maas:
status:	Triaged → In Progress
assignee:	nobody → Björn Tillenius (bjornt)

MAAS Lander (maas-lander) on 2020-08-03

Changed in maas:
milestone:	none → next
status:	In Progress → Fix Committed

Revision history for this message

Ole Kleinschmidt (oklhost) wrote on 2020-08-06:

Sounds great, thanks a lot! So there should be no more the need of a cronjob for snap restart maas?

Adam Collard (adam-collard) on 2020-10-05

Changed in maas:
milestone:	next → 2.9.0b4

Lee Trager (ltrager) on 2020-10-16

Changed in maas:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1889042

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-gitlab.isc.org-isc-projects-bind9-- #784
[closed] Edit

Bug watches keep track of this bug in other bug trackers.