named.conf.options.inside.maas reverts to default

Bug #1888536 reported by Ian Marsh
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Björn Tillenius
2.8
Fix Released
High
Björn Tillenius

Bug Description

Default /var/snap/maas/current/bind/named.conf.options.inside.maas looks like this:

-----
dnssec-validation auto;

allow-query { any; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
-----

After changing the upstream_dns config setting to our upstream DNS, and turning off dnssec_validation (our upstream DNS is a bit broken in that regard), it changes to this:

-----
forwarders {
    x.x.x.x;
};

dnssec-validation no;

allow-query { any; };
allow-recursion { trusted; };
allow-query-cache { trusted; };
-----

At this point, DNS works fine (in this case, we have a Juju-deployed OpenStack, and all machines/containers can resolve).

You wait... time passes. Services start timing out, and you discover DNS no longer works in your machines/containers.

Looking at /var/snap/maas/current/bind/named.conf.options.inside.maas, it has reverted to the default settings. Checking the config (with the maas CLI) shows the settings are still correct.

A quick fix is possible by changing the config (we set dnssec_validation to "yes" and then back to "no") which regenerates the named.conf.options.inside.maas correctly and all is well again. However, this shouldn't be considered a workaround, as the random service-affecting outages are not acceptable to our users, despite the quick fix.

We have observed this on:
2.8.1-8567-g.c4825ca06
2.7.1-8261-g.5143564e6

We haven't seen this behaviour on our pre-snap system (2.4.2-7034-g2f5deb8b8).

In case it's relevant, on our 2.8.1/2.7.1 systems we're running dual region/rack controllers for redundancy, so we're also using an external postgres. Our 2.4.2 is a single region/rack controller.

I'm hoping this is reproducable elsewhere. Downloading logs from the affected systems is difficult, and I don't currently have access to them. If my logs are necessary, I will add them when I can.

Related branches

Revision history for this message
Ian Marsh (drulgaard) wrote :

I've reproduced this on a test system to which I have full access.

At the time DNS starts failing, /var/snap/maas/common/log/named.log shows:

25-Jul-2020 09:44:34.333 ../../../lib/dns/rbtdb.c:1499: fatal error:
25-Jul-2020 09:44:34.333 RUNTIME_CHECK(rbtdb->next_serial != 0) failed
25-Jul-2020 09:44:34.333 exiting (due to fatal error in library)
25-Jul-2020 09:44:38.443 starting BIND 9.11.3-1ubuntu1.12-Ubuntu (Extended Support Version) <id:a375815>

So BIND crashes and is restarted - with a bad configuration.

I think the crash is actually due to changing the dnssec-validation option and reloading, and restarting instead of reloading prevents this crash. This means I should use a different 'quick fix'!

However, I can't find anything in the logs as to why the configuration file was changed, which is the real issue here.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Thanks for your bug report.

Yes, this looks like it's this issue upstream:

  https://gitlab.isc.org/isc-projects/bind9/-/issues/784

I still don't know why the configuration files are being reverted. I'm still looking into it.

Reading the issue above, you might want to do a 'snap restart maas' do ensure that the config is right, and then do a second 'snap restart maas'. That should restart bind with the correct config, and that might work around the issue.

Changed in maas:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Björn Tillenius (bjornt) wrote :

It's trivial to reproduce this by setting dnssec-validation to "no", and then do a 'kill -9' on the named process that maas started. Then you'll see the same behavior, that named.conf.options.inside.maas has reverted to the default values.

Revision history for this message
Björn Tillenius (bjornt) wrote :

We can't easily do anything about the upstream issue, but I'll fix MAAS so that it handles bind crashing a bit better, since we seem to have a bug there. That should prevent MAAS from putting the default config files in place, so bind should start to work as before after MAAS automatically restarts it.

Changed in maas:
status: Triaged → In Progress
assignee: nobody → Björn Tillenius (bjornt)
Changed in maas:
milestone: none → next
status: In Progress → Fix Committed
Revision history for this message
Ole Kleinschmidt (oklhost) wrote :

Sounds great, thanks a lot! So there should be no more the need of a cronjob for snap restart maas?

Changed in maas:
milestone: next → 2.9.0b4
Lee Trager (ltrager)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.