nfs-server.service needs name resolution and network online

Bug #1918141 reported by Niklas Edmundsson
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned
Groovy
Won't Fix
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned

Bug Description

[ Impact ]

nfs-server.service (part of the nfs-utils package) has insufficient dependencies to start correctly in a setting where the NFS exports list contains DNS hostnames (not in local hosts file) or netgroups served via network (for example sssd).

This issue can negatively impact users that, for example, cannot easily resort to using IP addresses directly in the /etc/exports file, and need to use hostnames instead in order to be able to mount NFS shares over the network.

[ Test Case ]

You can reproduce the issue by following the instructions below. Adjust the Ubuntu release on "images:ubuntu/hirsute" accordingly.

$ lxc launch images:ubuntu/hirsute nfs-utils-bug1918141 --vm
$ lxc shell nfs-utils-bug1918141
# apt update
# apt install -y nfs-kernel-server
# mkdir /testshare
# cat >> /etc/exports << EOF
/testshare ubuntu.com(ro,sync,no_subtree_check)
EOF
# systemctl edit --full systemd-networkd.service

You will have to:

- Comment out the "Wants=" line.
- Remove "network.target" from the "Before=" line.
- Add a new "After=fakenet.service" line.
- Add a new "Wants=fakenet.service" line.
- Add the following line in the "[Service]" section:

  ExecStartPre=/bin/sleep 10

# systemctl edit --force --full fakenet.service

This new service file will contain:

[Unit]
Description=Fake network.target
DefaultDependencies=no
After=systemd-udevd.service network-pre.target systemd-sysusers.service systemd-sysctl.service
Before=network.target multi-user.target shutdown.target
Conflicts=shutdown.target
Wants=network.target

[Service]
Type=oneshot
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

# systemctl enable fakenet.service
# systemctl disable systemd-resolved
# systemctl mask systemd-resolved
# rm /etc/resolv.conf
# cat > /etc/resolv.conf << EOF
nameserver 8.8.8.8
EOF
# reboot

After a few seconds, you can enter the VM again:

$ lxc shell nfs-utils-bug1918141
# sleep 20 && systemctl status nfs-server.service
● nfs-server.service - NFS server and services
     Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; vendor preset: enabled)
     Active: active (exited) since Tue 2021-03-16 20:51:42 UTC; 39s ago
    Process: 310 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
    Process: 311 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=0/SUCCESS)
   Main PID: 311 (code=exited, status=0/SUCCESS)

Mar 16 20:51:41 nfs-utils-bug1918141 systemd[1]: Starting NFS server and services...
Mar 16 20:51:41 nfs-utils-bug1918141 exportfs[310]: exportfs: Failed to resolve ubuntu.com
Mar 16 20:51:41 nfs-utils-bug1918141 exportfs[310]: exportfs: Failed to resolve ubuntu.com
Mar 16 20:51:42 nfs-utils-bug1918141 systemd[1]: Finished NFS server and services.

As you can see, the service was started but exportfs failed to resolve the hostname.

[Where problems could occur]

This fix has been applied upstream for quite a while now, and is even part of a release, so it has been extensively tested by users and other distributions that have more recent nfs-utils. I cannot easily envision problems with the proposed changes but:

* If the user has manually modified the nfs-server.service file without taking proper precautions (i.e., editin the file directly on /lib/systemd/system/ instead of using "systemctl edit"), then he might experience a conflict when installing a new version of the file, and his modifications will be lost. However, this falls under the "local configuration issue", in my opinion.

[ Original Description ]

nfs-server.service has insufficient dependencies to start correctly in a setting where the nfs exports list contains DNS host names (not in local hosts file) or netgroups served via network (for example sssd).

Typical failures listed by systemctl status nfs-server.service are:

Mar 08 14:16:52 server.example.com exportfs[844]: exportfs: Failed to resolve client1.example.com
Mar 08 14:16:52 server.example.com exportfs[844]: exportfs: Failed to resolve client2.example.com

Our workaround is to add the appropriate dependecies in /etc/systemd/system/nfs-server.service.d/dependencies.conf like so:

----------------------8<----------------------------
[Unit]
# nfs-server.service runs exportfs on startup, thus we need to be able to
# do host and netgroup lookups which requires network to be online.
After=network-online.target nss-lookup.target nss-user-lookup.target
----------------------8<----------------------------

While nfs-server.service do depend on network.target, that only means that the network has been configured. On physical hardware it can take significantly longer for the network to come online (8+ seconds for our 10G NICs). Also note that we configure static IP:s via systemd-networkd, things might behave differently when using DHCP, network-manager etc. In any case, depending on network.target is almost always wrong, and network-online.target is usually the right one.

nss-lookup.target is needed to ensure that DNS resolution works, and nss-user-lookup.target is the best approximation to ensure that netgroup resolution via sssd or equivalent works. Usually things "just works" even without these dependencies, but to ensure correct startup they should be present.

It should be noted that this seems to once have been fixed in Ubuntu, but has been lost along the way for quite some time. When googling I find for example https://code.launchpad.net/~ubuntu-branches/ubuntu/wily/nfs-utils/wily-201507271018/+merge/265946 that fixes the network.target vs network-online.target dependency, but it has since been lost in the wind it seems.

Related branches

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

I should note that we have observed this on both 18.04LTS (bionic) and 20.04LTS (focal).

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for the report and the great description.

The rationale here makes sense to me, and I agree with updating the packages in order to solve the described issue. However, since we will have to SRU this into Focal and Bionic (and I've just noticed that Hirsute's nfs-server.service file is basically the same as the one shipped in Focal, so we will have to this there as well), it would be really good to come up with a reproducer that can ideally be run in a VM (lxd, multipass, virt-manager, it doesn't matter).

I did a quick test here and fired up two LXDs VMs. I configured nfs-server in one of them, and placed the following in its /etc/exports:

/testshare nfs-utils-bug1918141-focal-2.lxd(rw,sync,no_subtree_check)

I verified that nfs-utils-bug1918141-focal-2.lxd can be resolved and is not in /etc/hosts, and then restarted the VM. It came back up right away, and the nfs-server service had been started successfully.

I fiddled a bit with the systemd services, even created one fake service that just sleeps for a certain amount of time, but I guess that in a VM if network.target is up then you can just use the network right away.

Anyway, if you have any ideas on how to reproduce this, I'm all ears. Otherwise, if it's really something hard to do in a VM, then we might just go ahead with the SRU and explain the situation. Thanks.

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

First off, nfs-server starts but doesn't export the unresolvable hosts. The failures do show in systemctl status nfs-server.

To reproduce in a VM you'll likely need something that causes network.target to be fulfilled but actual network traffic to not be forwarded until a few seconds later when network-online.target is fulfilled. Exactly how to do this varies between setups, but adding a unit that fulfills network.target shouldn't be too hard, delaying the network startup in a meaningful way is the tricky one. In any case, I recommend systemd-analyze plot > boot.svg to get an overview of the boot timings.

If I find time today I can take a stab at figuring something out that emulates the behavior we see on physical hardware, for these machines we see between 8 to 10 seconds for systemd-networkd-wait-online.service to complete, with the corresponding delay in network-online.target fulfillment. I think the record I've seen on physical hardware was 30+ seconds on first boot when doing a major OS upgrade trigging a more involved firmware download and NIC restart...

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

OK, getting the network delay got a little bit convoluted as you can't clear dependencies with overrides but instead have to copy the unit file and edit it.

I'm no systemd expert so this can probably be improved, but on a focal VM host of ours (KVM/Ganeti) that uses systemd-networkd I needed to do this, if you're using some other network setup scheme you'll need to adapt accordingly:

* cp /lib/systemd/system/systemd-networkd.service /etc/systemd/system/systemd-networkd.service
** Edit /etc/systemd/system/systemd-networkd.service:
*** Comment out Before= and Wants=
*** Add in [Unit] section, Before= without network.target and After=/Wants= fakenet.service:
**** Before=multi-user.target shutdown.target
**** After=fakenet.service
**** Wants=fakenet.service
*** Add in [Service] section:
**** ExecStartPre=/bin/sleep 10
* Create /etc/systemd/system/fakenet.service with content as [1] (bottom of comment)
* systemctl daemon-reload
* systemctl enable fakenet
* It turned out that systemd-resolved also messed with network.target, so I just disabled it
** Edit /etc/resolv.conf to contain usable DNS resolver
** systemctl stop systemd-resolved
** systemctl mask systemd-resolved
** Check /etc/resolv.conf again and verify that you can do DNS lookups

reboot, and then verify with systemd-analyze plot > boot.svg that timings are right, you should have network.target and then 10s or so later network-online.target (I just load boot.svg into a browser and use find to search in the file). If not you have to find the culprit that has a Before=network.target and edit or disable.

You should also find that nfs-server starts a few seconds before the network is available, and systemctl status nfs-server should show exportfs: Failed to resolve errors.

Hope this helps to QA/reproduce.

[1] fakenet.service:
[Unit]
Description=Fake network.target
DefaultDependencies=no
After=systemd-udevd.service network-pre.target systemd-sysusers.service systemd-sysctl.service
Before=network.target multi-user.target shutdown.target
Conflicts=shutdown.target
Wants=network.target

[Service]
Type=oneshot
ExecStart=/bin/true

[Install]
WantedBy=multi-user.target

Revision history for this message
Dan Streetman (ddstreet) wrote :

Ugh, the nfs-utils in Debian/Ubuntu is so incredibly old.

The service file(s) should be updated as they already have upstream, e.g.:
http://git.linux-nfs.org/?p=steved/nfs-utils.git;a=blob;f=systemd/nfs-server.service;h=b432f9102d0c50061890b511de1d61069f435991;hb=refs/heads/master

the important part for this bug is the service file needs to include Wants= and After= for network-online.target

tags: added: rls-hh-incoming
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks Niklas, I was able to reproduce the bug here. I'm working on fixing the service file and will post MPs soon.

Changed in nfs-utils (Ubuntu):
assignee: nobody → Sergio Durigan Junior (sergiodj)
description: updated
Changed in nfs-utils (Ubuntu Bionic):
status: New → Confirmed
Changed in nfs-utils (Ubuntu Focal):
status: New → Confirmed
Changed in nfs-utils (Ubuntu Groovy):
status: New → Confirmed
Changed in nfs-utils (Ubuntu Bionic):
assignee: nobody → Sergio Durigan Junior (sergiodj)
Changed in nfs-utils (Ubuntu Groovy):
assignee: nobody → Sergio Durigan Junior (sergiodj)
Changed in nfs-utils (Ubuntu Focal):
assignee: nobody → Sergio Durigan Junior (sergiodj)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nfs-utils - 1:1.3.4-4ubuntu2

---------------
nfs-utils (1:1.3.4-4ubuntu2) hirsute; urgency=medium

  * Depend on network-online.target when starting services. (LP: #1918141)
    - d/p/lp1918141-use-network-online-target-01.patch: Declare a
      Wants=network-online.target on all NFS server services.
    - d/p/lp1918141-use-network-online-target-02.patch: Declare a
      After=network-online.target on all NFS server services.
    Thanks to Niklas Edmundsson for helping with the reproducer.

 -- Sergio Durigan Junior <email address hidden> Mon, 15 Mar 2021 18:26:22 -0400

Changed in nfs-utils (Ubuntu Hirsute):
status: Confirmed → Fix Released
Robie Basak (racb)
tags: added: network-online-ordering
Revision history for this message
Robie Basak (racb) wrote :

I'm not sure it's *ever* correct for distribution packaging to ship After=network-online.target. One person's "wait for the network to come online" is another person's "my system hangs on boot because I'm booting offline" and yet another person's "one NIC was up but the one through which DNS is available was not so doing this didn't work anyway". See also: https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/, all bugs I've been tagging network-online-ordering, and today I also came across https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=878109 which describes the same issue. This means I'm also questioning the "fix" in Hirsute. This seems to run completely contrary to upstream's recommendations as linked above.

> Our workaround is to add the appropriate dependecies in /etc/systemd/system/nfs-server.service.d/dependencies.conf...

I believe this is the correct thing to do, but rather than a workaround it's simply configuring the service to your local requirements. We cannot predict all differing local requirements in packaging in advance, especially when they can conflict in practice.

In general, should all network-type services in Ubuntu be After=network-online.target, or not? I think this is a question we should answer for the entire distribution at once, rather than pushing piecemeal changes that lead to inconsistency and confusion. So -1 for the SRU until we have consensus, and I suggest reverting the change in Hirsute in the meantime also.

Revision history for this message
Robie Basak (racb) wrote :

Alternatively, if this is an exceptional case for which the upstream recommendation shouldn't apply, this should be clearly documented.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Robie - I remember the old discussions and the tag, In this particular case I think it is somewhat special because.

a) This isn't a Ubuntu decision - upstream of the package/functionality goes this way, so if challenged one should challenge it there. So if we want to we should do it there.

b) This one is also different as it isn't a service that your system "consumes" like root-iscsi or anything like it - there we'd get "my boot hangs". But in this case this exists to "provide" services to others. I know that there are others that have problems, e.g. things that fail to bind on an interface as they reject to listen to netlink to "bind later once available". But if this ends up as case-by-case decision

c) I remember a few of the old incarnations of this discussion, but it already has become a (use)case-by-case decision throughout various packages and services. Some clearly work best with After=network-online, some are clearly degraded with it - and sadly there are some in between which are either way depending on the sourrounding HW/Setup/Use-Case. Have a look at the growing list of packages that have implemented it that way - https://codesearch.debian.net/search?q=After.*network-online&literal=0

Based on that IMHO I think:
- the change for Hirsute is right
- it might be right to challenge it (to be sure about this particular case), but then let us do so at upstream nfs-utils
- I agree that for an SRU the potential regressions might be too much and there users can configure it accordingly

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

For this I honestly see no risks of regressions for nfs-server.

Also, be aware that on systems using DHCP depending on network.target or network-online.target has the same effect simply due to the inherent ordering of DHCP packets not passing through until the interface is able to pass traffic.

When using static addressing it's essential that things that require networking to be able to handle traffic do depend on the correct thing, and that's network-online.target

Regarding the "my system hangs on boot" comment:

1) nfs-server is a server service, it's not something you run on a laptop or somesuch and expect things to work without networking.
2) Network startup has a timeout, so the system will boot eventually.

Do remember that one of the appealing things about Ubuntu/Debian is the main goal of services to "just work", and the most basic thing here must be to get at least the dependencies right for things to work in the common usecase for which the service is intended.

As a final note on nfs-server.service:

It currently does two things: Start the nfs-server and do exportfs. It's the latter that depends on network-online, so the startup could be split into two services with tailored dependencies.

Revision history for this message
Robie Basak (racb) wrote :

> This isn't a Ubuntu decision - upstream of the package/functionality goes this way, so if challenged one should challenge it there.

I disagree. This is entirely a question of how we integrate between services. The distribution is where that integration happens, and users expect consistency of integration from the distribution. This is the point of packaging. Where there is a conflict, consistency here should be across different packages in the distribution, rather than between individual packages and their upstreams. So we should figure out what best practice should be across our package archive first.

> I remember a few of the old incarnations of this discussion, but it already has become a (use)case-by-case decision throughout various packages and services.

Even more reason to seek to develop a best practice then, rather than continue to exacerbate a problem that might already exist.

If it's decided that network-online.target should normally be provided by default by any package that provides a service that (might be configured/always) depend on a network service, then that's fine, the change in Hirsute will be correct, and we can continue to change packages when this issue is inevitably brought up again. But that should be a deliberate choice we make, not one made on a piecemeal basis only considering one use case at a time. Because consistency across the distribution matters for user expectations and a good user experience.

To be clear, I'm not presupposing the answer here; I just want this properly considered and a consistent decision made that considers the distribution as a whole.

> For this I honestly see no risks of regressions for nfs-server.

As Christian noted, there's a difference here in changing this for the future, and changing this in existing stable releases. There's a relatively low downside if we don't change this in existing stable releases. Users aren't blocked: they just need to configure their dependencies according to their local requirements. This is relatively easy and already explained here. The downsides in not making the change is: Users configuring a service for the first time will have to find an issue and discover the bug. The downside in making the change is: We're changing behaviour for users who've already configured the service, which might break them depending on their specific local setup.

Revision history for this message
Niklas Edmundsson (niklas-edmundsson) wrote :

For nfs-server and the risk of applying the fix to stable releases, this is my take:

It won't break anything unless you've done something truly esoteric, and then you're on your own anyway IMHO.

It will enable nfs-server to consistently start in configurations using static addressing, which is NOT esoteric or strange IMHO, regardless of the startup time of your NIC/switchport/etc. It will make startup consistent, which is currently not the case if your NIC is just slow enough to start to sometime trigger name resolve failures depending on how lucky you were when rebooting the machine.

The current behavior is unwanted, and while the fix was present in Ubuntu once upon a time (pre-systemd releases and wily I think), it has since gone missing so this is really a regression that's lingered longer than necessary.

Revision history for this message
Dan Streetman (ddstreet) wrote :
Download full text (3.8 KiB)

> I'm not sure it's *ever* correct for distribution packaging to ship After=network-online.target

you're correct, it's not. As you pointed out that upstream systemd recommends, what applications *should* do is handle interface configuration changes (e.g. carrier loss/gain), but few applications actually do that, as it's far easier to simply assume networking is 'already set up' from the application's POV.

> "my system hangs on boot because I'm booting offline"

indeed, and this is something that annoys me frequently, very commonly seen (for me at least) when my network configuration includes an interface that I know isn't currently connected during boot, and yet I have to stare at the boot progress waiting to timeout before it continues.

To clarify that point a bit, systemd-networkd itself doesn't block progress to multi-user.target; it only informs systemd-networkd-wait-online, which is what network-online.target waits for. There's a similar service for network-manager. And also, it's entirely possible to configure networkd to *not* block boot for any particular interface, using the RequiredForOnline= parameter (I'm not sure about anything equivalent for network-manager).

> This isn't a Ubuntu decision - upstream of the package/functionality goes this way

and this hits the core of the problem - few upstream applications really want to add the extra complexity of being able to dynamically handle networking. Which is generally fair, but depending on the exact application and the exact way it uses networking, it can sometimes be a problem.

> For this I honestly see no risks of regressions for nfs-server.

The only way this could possibly cause a regression is if nfs-server was the *only* (enabled) service on a user's system that included Wants=network-online.target. If *any* other service Wants=network-online.target (and *any* service also has After=network-online.target), then boot will wait for network-online.target, and it doesn't matter at all *how many* services want it, the timeout waiting for it is the same.

Note that cloud-config.service and cloud-final.service both depend on network-online.target, so *all* cloud images will block boot until network-online.target completes.

In a default Ubuntu install, there are various services that will block boot waiting for networking, most obviously whoopsie.service, which I believe is always installed by default, so again *all* Ubuntu installs will wait for network-online.target.

> Even more reason to seek to develop a best practice then, rather than continue to exacerbate a problem that might already exist.

The best practice would likely be to spend time working with individual applications so they could handle dynamic changes in networking better, but even then it needs to be a case-by-case basis for applications depending on network-online at boot. Additionally, unless we can remove the need for network-online.target from *all* services enabled on any particular machine, it won't help at all - only a single service pulling in network-online.target makes all other work pointless. Of course, simply working on getting all installed-by-default services to not require network-online.target...

Read more...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Guys, this has been sitting in unapproved for about 2 months now, and I need to push another fix for nfs-utils asap (see https://bugs.launchpad.net/bugs/1918141). I only noticed this unapproved upload when I was about to push the git tag, as the upload doesn't show up in rmadison's output.

I would prefer to not include this fix here in my upload, as it looks like there is no consensus and it's a bigger behavior change, and my is a pure bugfix with upstream patches that are already applied in redhat, debian, and focal+.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

... and mine* is a pure bugfix ...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Chris Halse Rogers (raof) wrote : Proposed package upload rejected

An upload of nfs-utils to focal-proposed has been rejected from the upload queue for the following reason: "There is still no consensus that this is the right approach, or is SRUable even if it becomes policy. Rejecting for now, to let other bugfix in. Once we've resolved the policy discussion, please re-upload (if that's the way policy goes!)".

Revision history for this message
Brian Murray (brian-murray) wrote :

I've also rejected this for Groovy.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Groovy Gorilla has reached end of life, so this bug will not be fixed for that release

Changed in nfs-utils (Ubuntu Groovy):
status: Confirmed → Won't Fix
Changed in nfs-utils (Ubuntu):
assignee: Sergio Durigan Junior (sergiodj) → nobody
Changed in nfs-utils (Ubuntu Bionic):
assignee: Sergio Durigan Junior (sergiodj) → nobody
Changed in nfs-utils (Ubuntu Focal):
assignee: Sergio Durigan Junior (sergiodj) → nobody
Changed in nfs-utils (Ubuntu Groovy):
assignee: Sergio Durigan Junior (sergiodj) → nobody
Changed in nfs-utils (Ubuntu Hirsute):
assignee: Sergio Durigan Junior (sergiodj) → nobody
Revision history for this message
latimerio (fomember) wrote :

I had the same problem and tried to solve it with the nfs-server.service unit, but did not succeed.
So I finally ended up with a workaround and put a script into crontab, like

    #crontab -e
    @reboot /usr/local/bin/exportNFS

and

    #cat /usr/local/bin/exportNFS
    sleep 5
    /usr/sbin/exportfs -r

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.