boot process hangs very often when NFS shares are used

Bug #1233610 reported by Zygmunt Krynicki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Won't Fix
Medium
Unassigned

Bug Description

I'm running up-to-date Ubuntu 12.04.3 with the 3.2 kernel. I have a FreeNAS box exporting a number of NFS shares. My machine can boot once-in-a-while but I have problems on practically every boot, that I need to resolve by rebooting and trying again.

My fstab:

# /etc/fstab: static file system information.
proc /proc proc nodev,noexec,nosuid 0 0
UUID=2f6ca502-9419-4040-a702-2c9dc716dbc5 / ext4 errors=remount-ro 0 1
nodev /ramdisk tmpfs defaults 0 0
silverbox:/mnt/vol1/software /nas/software nfs auto 0 0 # ENROLL
silverbox:/mnt/vol1/videos /nas/videos nfs auto 0 0 # ENROLL
silverbox:/mnt/vol4/backup /nas/backup nfs auto 0 0 # ENROLL
silverbox:/mnt/vol4/home /home nfs auto,exec 0 0 # ENROLL
silverbox:/mnt/vol4/source /home/zyga/source nfs auto 0 0 # ENROLL
silverbox:/mnt/vol4/steam /nas/steam nfs auto,bootwait 0 0 # ENROLL
silverbox:/mnt/vol4/music /nas/music nfs auto,bootwait 0 0 # ENROLL
silverbox:/mnt/vol4/photos /nas/photos nfs auto,bootwait 0 0 # ENROLL

The network between the two boxes is working perfectly over gigabit wired connection. I can always mount each share explicitly, it only causes failures a boot. My local network uses openwrt routers and has correct DNS setup for each machine.

My desktop (the machine affected by this bug) uses network manager with DHCP connection but I did try static IP before and it had no effect on the failure rate.

I've added a way to open an emergency tty (patched /etc/init/tty6.conf to start on startup) and inspected mountall logs (patched mountall.conf to have --debug, not have --verbose, have console log and not 'expect daemon'). I'll attach /var/log/mountall.log from a successful boot below.

I have tried to debug this issue with jodh and xnox on #ubuntu-devel and got asked to report this and wait for slangasek. I can freely reproduce this bug and I can assist in debugging if required.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :
Revision history for this message
Steve Langasek (vorlon) wrote :

A log from a successful boot isn't going to tell me much. I need to see a log from a *failed* boot. :-)

How are your network devices configured on this client? ifupdown, network-manager?

Changed in mountall (Ubuntu):
status: New → Incomplete
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Adding failed mountall log, I have an associated mount output from that very same moment which shows that everything is mounted correctly though.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Adding mount output from a FAILED boot, note that everything is actually mounted. Please correlate this to the failed mountall.log which thinks that not all of the network filesystems have been mounted.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Steve, as described in the bug description I'm using network manager with DHCP setup. I have confirmed that each time stuff fails networking was working reliably (no errors, IP and DNS okay, etc).

Changed in mountall (Ubuntu):
status: Incomplete → New
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Looking at --debug log from mountall I can see that while some of my NFS shares are tagged as remote /home and /home/zyga/source are _not_.

Revision history for this message
Steve Langasek (vorlon) wrote :

This may be the same as bug #643289. Awaiting test results using the quantal nfs-common; if that fixes the boot hangs, we can mark this as a duplicate and I can get around to doing that SRU.

Revision history for this message
Steve Langasek (vorlon) wrote :

Incidentally,

 silverbox:/mnt/vol4/home /home nfs auto,exec 0 0 # ENROLL

Why 'exec'? That's a default option; it's possible that mountall is confused and thinks that the mount needs remounted since the mount options don't match those specified.

Revision history for this message
Zygmunt Krynicki (zyga) wrote : Re: [Bug 1233610] Re: boot process hangs very often when NFS shares are used

I added exec by mistake, I recall seeing noexec behavior (a while ago) so I
just slap this to be sure (my ~/.local/bin is full of stuff)

On Wed, Oct 2, 2013 at 9:58 PM, Steve Langasek <<email address hidden>
> wrote:

> Incidentally,
>
> silverbox:/mnt/vol4/home /home nfs auto,exec 0 0 # ENROLL
>
> Why 'exec'? That's a default option; it's possible that mountall is
> confused and thinks that the mount needs remounted since the mount
> options don't match those specified.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1233610
>
> Title:
> boot process hangs very often when NFS shares are used
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1233610/+subscriptions
>

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Oct 02, 2013 at 09:01:25PM -0000, Zygmunt Krynicki wrote:
> I added exec by mistake, I recall seeing noexec behavior (a while ago) so I
> just slap this to be sure (my ~/.local/bin is full of stuff)

Does removing this option change the behavior?

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Steve, removing exec improves nothing. I can reproduce the faulty behavior with _ONE_ nfs share (/home). I"m going to try upgrading muntall / nfs-common now.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I've updated to mountall 2.42ubuntu0.4 and nfs-common 1:1.2.6-3ubuntu2 -- rebooting

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

About a dozen reboots later I can say that it's not working.

I tried a few simplifications:

1) Try with my vanilla /etc/fstab -> hangs all the time (I never got to boot to desktop this way)
2) Drop everything but home (but keep 'bootwait') -> worked once (just typing this now), needs more testing to see if this is stable
3) Drop everything but home (also drop 'bootwait') -> this feels like a regression over the previous state (precise packages). I would hang about 70% of the time I've tried it. If it managed to boot to desktop /home would not be mounted! IIRC mountall in precise automatically adds 'bootwait' for /home, is this not the case in quantal?

I also got a case that feels like something else is affecting this. With just /home (without bootwait then) I got to the step where everything was mounted but apparently mountall didn't say it was "done". Looking at the log file I see that mountall could not talk to plymouth. I have a log file to show:

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Also, in the same log file above, it seems that mountall again fails to mark /home as remote filesystem. See how mountall says "remote 0/0". That feels like an evident bug.

Revision history for this message
Steve Langasek (vorlon) wrote :

> I've updated to mountall 2.42ubuntu0.4 and nfs-common 1:1.2.6-3ubuntu2 -- rebooting

I would not expect mountall 2.42ubuntu0.4 to make any difference. On IRC, what I suggested was that you try nfs-common from quantal (1:1.2.6-3ubuntu2), and if that didn't work, try adding mountall from saucy (2.51). There are a number of NFS-related fixes in the saucy mountall, and these symptoms could certainly be related to one of them - particularly the mount point not being correctly tagged "remote".

Could you test with mountall 2.51?

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Steve, mountall 2.42ubuntu0.4 is from quantal-updates as shown here http://packages.ubuntu.com/search?keywords=mountall&searchon=names&suite=all&section=all

I will try saucy packages next. Thanks.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I've updated mountall to 2.51 which also pulled in libudev1 204 which also pulled in libc6 2.17-93 (both :i386 and :amd64)

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

rebooting for testing...

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Second reboot (first one hanged), got to manual recovery prompt, poked around (just looked at the log file), then closed the manual recovery shell and got to desktop. I'll do more testing but it still seems wrong (the first time I didn't run the manual recovery shell and it kept being stuck).

I have a feeling that mount fails to retry after initially failing to do stuff before network manager really gets the interface ready for usage. I got something similar (apparently network manager says networking works BEFORE It really works) while working on unrelated software in saucy last week. I've added 'sleep 2' after network manager said "NM_STATE 70" (70 being globally routed connection available) and my issues went away.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

zyga@fx:~$ sudo apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
You might want to run 'apt-get -f install' to correct these.
The following packages have unmet dependencies:
 libc-dev-bin : Depends: libc6 (< 2.16) but 2.17-93ubuntu1 is installed
 libc6-dbg : Depends: libc6 (= 2.15-0ubuntu10.4) but 2.17-93ubuntu1 is installed
 libc6-dev : Depends: libc6 (= 2.15-0ubuntu10.4) but 2.17-93ubuntu1 is installed
 libnih1 : PreDepends: libc6 (< 2.16) but 2.17-93ubuntu1 is installed
E: Unmet dependencies. Try using -f.

^^ - should I upgrade those to saucy as well?

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Oct 07, 2013 at 06:44:23PM -0000, Zygmunt Krynicki wrote:
> zyga@fx:~$ sudo apt-get dist-upgrade
> Reading package lists... Done
> Building dependency tree
> Reading state information... Done
> You might want to run 'apt-get -f install' to correct these.
> The following packages have unmet dependencies:
> libc-dev-bin : Depends: libc6 (< 2.16) but 2.17-93ubuntu1 is installed
> libc6-dbg : Depends: libc6 (= 2.15-0ubuntu10.4) but 2.17-93ubuntu1 is installed
> libc6-dev : Depends: libc6 (= 2.15-0ubuntu10.4) but 2.17-93ubuntu1 is installed
> libnih1 : PreDepends: libc6 (< 2.16) but 2.17-93ubuntu1 is installed
> E: Unmet dependencies. Try using -f.

> ^^ - should I upgrade those to saucy as well?

If you want your system to be in a consistent state...

Otherwise, you might want to just rebuild mountall 2.51 against precise, and
downgrade libc/libudev.

Revision history for this message
Steve Langasek (vorlon) wrote :

So here's what we know now:
 - the reason I've not been able to reproduce this problem is because my mounts are listed by name, and are *not* resolvable before the network is up; so the mount helper fails immediately with an unresolvable name (since mount correctly detects the network is down, and propagates this failure up the stack). In the submitter's case, the name *is* resolvable (via /etc/hosts) before the network is up, which means that the mount helper gets an IP to pass down to the kernel, which it does... and the kernel's behavior in response to a mount request for an unreachable IP is less than stellar (to wit: it does *not* immediately return an error).
 - even after sorting out this problem, the latest boot logs still show a problem getting all the way through the mounts. The mounts all succeed and are reported back, but the 'mounted' events don't finish for all of these, either because something in upstart is blocking them or because they get lost along the way in mountall.

My current theory as to why the mounted events are being lost is because of a known issue with SIGUSR1 triggering duplicate 'mounting' events (bug #1048017)... perhaps if one of these 'mounting' events is being triggered late for the mount (which appears to be the case, from the logs), it is blocking the 'mounted' event from happening correctly. This is supported by the fact that the logs show an extra 'mounting' event for the mounts that aren't correctly recorded, vs. the ones that are correctly recorded.

Of further note, in the latest log we're getting 3/8 remote mounts recorded. But two of these are actually events for the *same* mountpoint. I've seen this issue previously but don't have a bug report open for it; it's certainly not the cause of boot hangs, but it could cause mountall to emit the 'filesystem' event too early.

Revision history for this message
Steve Langasek (vorlon) wrote :

void
mounted (Mount *mnt)
{
[...]
       if (!mnt->pending_call)
                emit_event ("mounted", mnt, mounted_event_handled);
[...]
}

Yeah, that needs fixed. :/

Revision history for this message
Steve Langasek (vorlon) wrote :

Zygmunt, can you please try building a prerelease of mountall 2.52 from lp:ubuntu/mountall? I believe the latest commit addresses bug #1048017, which should also take care of some of the knock-on effects on your system. Not sure if it'll get us all the way there, but I think it should at least fix it so we don't miss the 'mounted' events.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Sure, I'll give it a try and report back. Thanks!

On Tue, Oct 8, 2013 at 4:14 AM, Steve Langasek <<email address hidden>
> wrote:

> Zygmunt, can you please try building a prerelease of mountall 2.52 from
> lp:ubuntu/mountall? I believe the latest commit addresses bug #1048017,
> which should also take care of some of the knock-on effects on your
> system. Not sure if it'll get us all the way there, but I think it
> should at least fix it so we don't miss the 'mounted' events.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1233610
>
> Title:
> boot process hangs very often when NFS shares are used
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/1233610/+subscriptions
>

Revision history for this message
Steve Langasek (vorlon) wrote :

Note that mountall 2.52 has now been uploaded to saucy (in order to fix other issues identified in 2.51... also related to network mounts). I'm pretty sure this will fix the remaining problems for you (it did for me in testing locally).

Changed in mountall (Ubuntu):
status: New → Incomplete
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I've built mountall 2.52 for precise and after one test I got to a fully working boot. I'll experiment some more but this looks like the right trail. Will mountall 2.25 be added to ubuntu-updates?

Changed in mountall (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I'm setting this to triaged. If something new shows up I'll post updates

Revision history for this message
Steve Langasek (vorlon) wrote :

I'm not sure this is backportable, the changes here are all intertwined and it's hard to be sure there are no regressions. But we'll leave a precise task open, for now.

Changed in mountall (Ubuntu):
status: Triaged → Fix Released
Steve Langasek (vorlon)
Changed in mountall (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in mountall (Ubuntu Precise):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.