systemd unit service file does not wait for bind9 to be ready

Bug #1899902 reported by Aristarkh Zagorodnikov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9 (Debian)
Fix Released
Unknown
bind9 (Ubuntu)
Fix Released
Undecided
Athos Ribeiro

Bug Description

The details of this issue are available at Debian Bug report #900788 (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900788).

The named service in Focal is defined as follows:
$ cat /lib/systemd/system/named.service
[Unit]
Description=BIND Domain Name Server
Documentation=man:named(8)
After=network.target
Wants=nss-lookup.target
Before=nss-lookup.target

[Service]
EnvironmentFile=-/etc/default/named
ExecStart=/usr/sbin/named -f $OPTIONS
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop

[Install]
WantedBy=multi-user.target
Alias=bind9.service

Unfortunately, since bind9 starts up asynchronously, the services that depend on nss-lookup.target may fail DNS resolution due to bind9 not being started up in time. This, in some cases, leads to the crashing of services that resolve domain names on startup, for example, nginx that resolves its proxy upstream servers.

Tags: server-next

Related branches

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you for the report Aristarkh.
It is odd that this was fixed so long ago - 1:9.11.4+dfsg-1 is what the debian bug states.

That was only pushed to Debian buster in:
 195 bind9 (1:9.11.4+dfsg-1) unstable; urgency=medium
 196
 197 [ Bernhard Schmidt ]
 198 * Use systemd Type=forking to signal daemon init.
 199 Thanks to Elie Roudninski (Closes: #900788)
...
 209 -- Ondřej Surý <email address hidden> Sat, 14 Jul 2018 12:27:56 +0000

But I don't see it on any path to Debian/sid - not in the past changelogs nor in the currently used .service files.

I need to ask if this might have had an intentional reason ... or if it really was "lost again" later on.

Changed in bind9 (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After consulting with Andreas it seems on the switch between 9.11 -> 9.16 a few things got lost in Debian as that was branched earlier - [1] is an effort to bring some back. This might be one more of them.

Subscribing the team to take a deeper look.

[1]: https://salsa.debian.org/dns-team/bind9/-/merge_requests/8

tags: added: server-next
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

https://salsa.debian.org/dns-team/bind9/-/commit/40fd1691a9b07c36573373eb872139b4e836b722 is the missing commit, both in ubuntu and current debian sid

Changed in bind9 (Debian):
status: Unknown → Fix Released
Changed in bind9 (Ubuntu):
assignee: nobody → Athos Ribeiro (athos-ribeiro)
Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote (last edit ):
Download full text (4.3 KiB)

I filed an MP for impish and and MR for Debian at https://salsa.debian.org/dns-team/bind9/-/merge_requests/17 with the missing commit and a regression autopkgtest test case.

I will start working on SRUs after I get reviews on those.

Note that there may be an issue with testing the changes using nginx in impish, which is being addressed here: https://salsa.debian.org/dns-team/bind9/-/merge_requests/17/diffs#146a89a78e4846ec68c44d1e006f142585418863_0_81

The reason I am mentioning nginx in particular is that this is the package that led the reporter to file this bug against bind9. If you are not interested in the resolution for nginx, you can skip the rest of this comment.

The nginx package in impish has an ExecStartPre= directive in it's systemd unit file. While testing the patch above with the provided autopkgtest test case (without the sed command to remove the ExecStartPre= directive) I was facing a situation where the test would succeed in some of the runs, and fail in other runs.

The issue did not happen for focal, which does not contain the ExecStartPre= directive in the nginx unit file. In the failing cases, checking the systemd service status (or logs) for nginx would contain errors on failed configuration file checks (which is what the ExecStartPre= directives does in the impish nginx package), hindering the service start up process:

```
Jun 18 18:37:41 test-bind9-repr-script-impish systemd[1]: Starting A high performance web server and a reverse proxy server...
Jun 18 18:37:41 test-bind9-repr-script-impish nginx[285]: nginx: [emerg] host not found in upstream "bind9test.debian.dev" in /etc/nginx/sites-enabled/default:4
Jun 18 18:37:41 test-bind9-repr-script-impish nginx[285]: nginx: configuration file /etc/nginx/nginx.conf test failed
Jun 18 18:37:41 test-bind9-repr-script-impish systemd[1]: nginx.service: Control process exited, code=exited, status=1/FAILURE
Jun 18 18:37:41 test-bind9-repr-script-impish systemd[1]: nginx.service: Failed with result 'exit-code'.
Jun 18 18:37:41 test-bind9-repr-script-impish systemd[1]: Failed to start A high performance web server and a reverse proxy server.
```

Sometimes, disabling the nginx service, enabling it back, and rebooting the system was the only way I could get the issue to manifest itself (which would still be an intermittent issue).

Here is a short reproducer for the issue I am observing. un-commenting the Wants bit does the trick and get everything to start in the expected order.

```
#!/bin/bash

set -xe

MAIN_UNIT=$HOME/.config/systemd/user/reproducer-main.service
DEPENDENCY_UNIT=$HOME/.config/systemd/user/reproducer-dependency.service
TMP_LOGFILE=$(tempfile)

cat << EOF > $MAIN_UNIT
[Unit]
Description=testing stuff
After=reproducer-dependency.service
# Wants=reproducer-dependency.service

[Service]
Type=forking
ExecStartPre=sh -c 'echo I should be msg 2 >> $TMP_LOGFILE'
ExecStart=sh -c 'echo I should be msg 3 >> $TMP_LOGFILE'

[Install]
WantedBy=default.target
EOF

cat << EOF > $DEPENDENCY_UNIT
[Unit]
Description=This should be triggered BEFORE everything else

[Service]
Type=forking
ExecStart=sh -c 'echo I should be msg 1 >> $TMP_LOGFILE'

[Install]
WantedBy=default.target
E...

Read more...

Revision history for this message
Dan Streetman (ddstreet) wrote :

> un-commenting the Wants bit does the trick and get everything to start in the expected order.

> Is this just the expected systemd behaviour

yes.

After= only causes the service to start after the requested unit(s) when part of a single "transaction", like bootup (i.e. when the transaction is starting multi-user.target or similar).

Calling systemctl start with multiple services doesn't put them into the same transaction, so each is started completely independently of each other.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Thanks, Dan!

Do you have any idea for the described behavior with nginx then? I could only observe that when the ExecStartPre= is present, and right after reboots.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I think you are seeing races, that's all. The result can go either way. I worry the dep8 test might be flaky.

I initially thought that switching to Type=forking would make things even worse, but apparently that's not the case. Now we even get config errors reported in the "systemctl start bind9", so that's a good change already.

Have you tried crearing a huge zone for bind9, then see how your test behaves? Without removing the ExecStartPre

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Do you have any idea for the described behavior with nginx then

I looked in hirsute, so not sure if things are different upstream, but in that release the named.service is:

[Service]
EnvironmentFile=-/etc/default/named
ExecStart=/usr/sbin/named -f $OPTIONS
ExecReload=/usr/sbin/rndc reload
ExecStop=/usr/sbin/rndc stop
Restart=on-failure

since this defines an ExecStart= but doesn't set the Type=, this service defaults to a 'simple' type service, which means systemd considers it 'started' essentially immediately (specifically, as soon as systemd forks a thread to run the ExecStart=). So anything that is set to After=named.service (or, set that way indirectly, e.g. with After=nss-lookup.target) will be started when the process running the named binary has been started, not when named actually is "up and running".

The named.service should be adjusted to be a different Type=, probably 'forking' though it would depend on the exact behavior of when named actually forks when the -f param isn't used. Essentially, the named.service file should be configured to not be 'running' until named is actually 'running'. See 'man systemd.service' for details on the the Type= options.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Thanks for the reply, Andreas and Dan.

Dan, that is what the patch in the linked MP is doing (set the type to forking and removing the -f option from the named binary) and that is where I am observing the described behavior (but just in impish, where there is an ExecStartPre for the nginx unit). It does behave as expected when there is not ExecStartPre= in the nginx unit file.

Andreas, I am not sure if I get the races idea on the dep8 tests since I am restarting the system and there is an "After=" directive there... I will work on creating that "huge zone" for bind9 and report the results when I have them.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Using a long zone file and limiting the resources of the system I was using to test these changes (just limiting lxc CPU cycles) I found a set up where I can constantly hit the issue.

As Andreas pointed out, we are seeing a race condition here.

After forking, bind9 signals that it is ready by returning on the parent process. This leads systemd to mark the service as ready. Finally nginx starts running.

After bind9's parent process exits, bind9 reloads the response policy zones, creating a small gap where a name in the rpz would not be resolved. If the nginx name checks happen during that gap, nginx will fail to start.

I will change the tests to just add a record to the fake name and then we should have a sound test.

Thanks, Andreas and Dan :)

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote :

Although the proposed patch addresses this bug request, there is still a race condition that may result in having a service that should start "After=" nss-lookup.target to fail resolving names during start-up and, consequently, failing to start.

This should be dealt with upstream, and in fact, there is an upstream bug which, if fixed, should help dealing with the race condition: https://gitlab.isc.org/isc-projects/bind9/-/issues/1176

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9 - 1:9.16.8-1ubuntu3.2

---------------
bind9 (1:9.16.8-1ubuntu3.2) impish; urgency=medium

  * d/bind9.named.service: use systemd Type=forking to signal daemon init. This
    fixes a regression of #900788 where services whose startup depend on name
    resolutions may fail due to bind9 not being ready (LP: #1899902).

 -- Athos Ribeiro <email address hidden> Fri, 18 Jun 2021 09:24:39 -0300

Changed in bind9 (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Aristarkh Zagorodnikov (onyxmaster) wrote :

Athos, many thanks for taking care of this issue. I hope that BIND developers would pick up the sd_notify() proposal to make the fix for the issue even better.

Revision history for this message
Athos Ribeiro (athos-ribeiro) wrote (last edit ):

I know this one is closed, but as a follow up to Aristarkh's last comment, it seems this is getting some upstream attention, thanks to Ondřej :)

https://gitlab.isc.org/isc-projects/bind9/-/issues/1176

Revision history for this message
Ondřej Surý (ondrej) wrote :

Folks, here: https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/5514

(I haven't tested it yet, but you can follow the progress there...)

It will most probably make it into 9.18.0 (in December 2021).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.