Long delay when mounting NFS shares

Bug #1327563 reported by My Karlsson
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Fix Released
Medium
Stefan Bader
Utopic
Fix Released
Medium
Unassigned

Bug Description

SRU justification:

Impact: Kernel 3.13 shows a 16s delay when mounting NFS shares without using gssd (kerberos authentication) because it does not detect whether gssd is running at all or not.

Fix: This was fixed upstream in 3.14 by having gssd create a dummy pipe which is used as an indication of it running. Cherry-picking 3 patches from upstream gets rid of the delay.

Testcase: Trying to mount a NFS share with unix authentication will be delayed by 16s when not forcing NFSv3 protocol (using -onfsvers=3 with the mount command). After applying the patches the delay is gone.

---

Summary:

After upgrading the kernel on Ubuntu 14.04 to 3.13.0-29-generic I found that it can take a considerably longer amount of time to mount NFS shares.

How to reproduce:

1. Install Ubuntu 14.04 on two machines. A minimal server install should be fine. One will act as a server, the other as a client.
2. Upgrade the kernel on the client machine to at least 3.13.0-27-generic.
3. Install the nfs-kernel-server package on the server.
4. Add a directory to /etc/exports and run service nfs-kernel-server start.
5. Install the nfs-common package on the client.
6. Attempt to mount the shared directory on the client.

Expected result:

The shared directory should be mounted on the client more or less immediately.

Actual result:

The mount command hangs and eventually completes. In this test case it takes 16 seconds consistently. On a production machine where the problem was first discovered the time is considerably longer and essentially makes it impossible to mount NFS shares.

Regression:

I have been able to reproduce the problem on clients running Linux 3.13.0-27-generic and later. The problem is not reproducible on clients running Linux 3.13.0-24-generic. The problem is reproducible no matter which kernel version is used on the server.

Tested client kernel versions:

Linux 3.13.0-29-generic #53-Ubuntu FAIL
Linux 3.13.0-27-generic #50-Ubuntu FAIL
Linux 3.13.0-24-generic #47-Ubuntu OK

Tested server kernel versions:

Linux 3.13.0-29-generic #53-Ubuntu OK
Linux 3.13.0-27-generic #50-Ubuntu OK
Linux 3.13.0-24-generic #47-Ubuntu OK

lsb_release -rd:

Description: Ubuntu 14.04 LTS
Release: 14.04

apt-cache policy linux-image-generic:

linux-image-generic:
  Installed: (none)
  Candidate: 3.13.0.29.35
  Version table:
     3.13.0.29.35 0
        500 http://se.archive.ubuntu.com/ubuntu/ trusty-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
     3.13.0.24.28 0
        500 http://se.archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-29-generic 3.13.0-29.53
ProcVersionSignature: Ubuntu 3.13.0-29.53-generic 3.13.11.2
Uname: Linux 3.13.0-29-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 7 16:26 seq
 crw-rw---- 1 root audio 116, 33 Jun 7 16:26 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: [Errno 2] No such file or directory: 'fuser'
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Sat Jun 7 16:30:02 2014
HibernationDevice: RESUME=UUID=9c13108b-d40a-4881-b563-c477ed2e6804
InstallationDate: Installed on 2014-06-07 (0 days ago)
InstallationMedia: Ubuntu-Server 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.2)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
MachineType: QEMU Standard PC (i440FX + PIIX, 1996)
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-29-generic root=UUID=9068010d-8332-47c8-ae62-ccc62ca57290 ro
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-29-generic N/A
 linux-backports-modules-3.13.0-29-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/01/2011
dmi.bios.vendor: Bochs
dmi.bios.version: Bochs
dmi.chassis.type: 1
dmi.chassis.vendor: Bochs
dmi.modalias: dmi:bvnBochs:bvrBochs:bd01/01/2011:svnQEMU:pnStandardPC(i440FX+PIIX,1996):pvrpc-i440fx-2.0:cvnBochs:ct1:cvr:
dmi.product.name: Standard PC (i440FX + PIIX, 1996)
dmi.product.version: pc-i440fx-2.0
dmi.sys.vendor: QEMU

Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :
description: updated
summary: - Unable to mount NFS shares
+ Long timeout when mounting NFS shares
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
description: updated
description: updated
summary: - Long timeout when mounting NFS shares
+ Long delay when mounting NFS shares
tags: added: kernel nfs regression-update
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a bisect to figure out what commit caused this regression. We need to identify the earliest kernel where the issue started happening as well as the latest kernel that did not have this issue.

Can you test the following kernels and post back? We are looking for the first kernel version that exhibits this bug:

v3.13.10: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11-trusty/
v3.13.11.1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.1-trusty/
v3.13.11.2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.2-trusty/

You don't have to test every kernel, just up until the kernel that first has this bug.

Thanks in advance!

tags: added: performing-bisect
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :

I have installed and tested all three kernels (v3.13.10, v3.13.11.1 and v3.13.11.2) by rebooting and selecting the kernel version at the grub menu. I did however get the same 16 second delay with all of them. I went even further back and tested v3.13.10-trusty, v3.13.9-trusty, v3.13.8-trusty, v3.13.7-trusty, v3.13.6-trusty and v3.13.5-trusty but they all showed the same behavior. It still works fine when going back to 3.13.0-24-generic though.

It did however find that the problem appears to be limited to NFSv4. There is no noticable delay when using the vers=3 mount option.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

That is strange that upstream 3.13.9 also has the bug, since that is what 3.13.0-24-generic is based on. I don't see any specific SAUCE patches that would fix this. I do see a number of nfs changes that went into upstream 3.13.11.1 and 3.13.11.2:

  * nfsd4: buffer-length check for SUPPATTR_EXCLCREAT
  * nfsd4: session needs room for following op to error out
  * nfsd4: leave reply buffer space for failed setattr
  * nfsd4: fix test_stateid error reply encoding
  * nfsd: notify_change needs elevated write count
  * nfsd4: fix setclientid encode size
  * nfsd: check passed socket's net matches NFSd superblock's one

Can you confirm that v3.11.9 does in fact exhibit the bug:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.9-trusty/

If it does, then we may need to perform a kernel bisect between the 3.13.0-24-generic and 3.13.0-27-generic Ubuntu kernels.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Sorry meant to say:
Can you confirm that v3.13.9 does in fact exhibit the bug, not 3.11.9.

Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :

I can confirm that v3.13.9 does in fact exhibit the bug. I installed linux-image-3.13.9-031309-generic_3.13.9-031309.201404031554_amd64.deb and rebooted into the new kernel. Running uname -r shows "3.13.9-031309-generic". Mounting an nfs share, even when shared from localhost, results in a long delay of 16 seconds or more.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can you test one additional upstream kernel before i bisect between v3.13.0-24-generic and v3.13.0-26-generic? Can you test the latest mainline kernel, which can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc1-utopic/

Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :

I have now tested v3.16-rc1-utopic and v3.16-rc2-utopic and they both appear to work fine on 14.04 with NFSv4. I was unable to reproduce the bug.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It sounds like this bug is fixed upstream in mainline. Can you now test the latest 3.13 upstream stable kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.13.11.4-trusty/

If that still has the bug, then we can perform a "Reverse" bisect to find what commit fixes this in mainline.

Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :

I installed linux-image-3.13.11-03131104-generic_3.13.11-03131104.201406201536_amd64.deb from v3.13.11.4-trusty but unfortuneately I was still able to reproduce the bug. I tested it on two machines just to be sure, but there is still a 16 second delay when mounting from one machine to the other as well as locally.

Revision history for this message
Stefan Bader (smb) wrote :

To add a bit of debugging I did (using two VMs):
- There server side kernel version does not matter (tested trusty and utopic)
- A client with a 3.13 kernel will have the delay, while the 3.16 kernel is ok

Enabling NFS debugging on both clients (as root "echo 32767 >/proc/sys/sunrpc/nfs_debug") shows an interesting fact:

3.13 client:
 7581.776148] NFS: nfs4_discover_server_trunking: testing 'lam-utopic6401'
 7581.776154] NFS call setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.2.19
 7596.776086] RPC: AUTH_GSS upcall timed out.
 7596.776086] Please check user daemon is running.
 7596.776110] NFS reply setclientid: -13
 7596.776121] NFS call setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.2.19
 7597.024109] NFS reply setclientid: -13
 7597.024766] NFS call setclientid auth=UNIX, 'Linux NFSv4.0 192.168.2.192/192.
 7597.025506] NFS reply setclientid: 0
 7597.025512] NFS call setclientid_confirm auth=UNIX, (client ID fb69c653040000
 7597.026068] NFS reply setclientid_confirm: 0

3.16 client:
 2137.866775] NFS: nfs4_discover_server_trunking: testing 'lam-utopic6401'
 2137.866783] NFS call setclientid auth=UNIX, 'Linux NFSv4.0 192.168.2.120/192.
 2137.867420] NFS reply setclientid: 0
 2137.867426] NFS call setclientid_confirm auth=UNIX, (client ID fb69c653050000
 2137.867727] NFS reply setclientid_confirm: 0

So the newer kernel seems to skip the other authentication methods. And of those two AUTH_GSS is causing the delay because it times out. Unfortunately I did not spot any commits in between 3.13 and now that would immediately sound like a good candidate for fixing this.

Revision history for this message
Pete Hildebrandt (send2ph) wrote :

Seems to be fixed now. Everyone else agree?

Revision history for this message
My Karlsson (mykarlsson-deactivatedaccount) wrote :

Unfortunately no, I can still reproduce it on 3.13.0-32-generic.

Revision history for this message
Bruno (bruno666-666) wrote :
Revision history for this message
Stefan Bader (smb) wrote :

This looks to be a kernel side issue which was fixed in 3.14 by:

* nfs: check if gssd is running before attempting to use krb5i auth in SETCLIENTID call
* sunrpc: replace sunrpc_net->gssd_running flag with a more reliable check
* sunrpc: create a new dummy pipe for gssd to hold open

Cherry-picking those three patches on top of 3.13 gets rid of the delay.

Changed in linux (Ubuntu Trusty):
assignee: nobody → Stefan Bader (smb)
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Stefan Bader (smb)
description: updated
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Trusty):
status: Triaged → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
Revision history for this message
Bruno (bruno666-666) wrote :

I confirm that's this bux is fixed in the new kernel release 3.13.0-35.62 (amd64).
Thanks for your work, this was a very annoying bug for me.

Bruno (bruno666-666)
tags: added: verification-done-trusty
removed: verification-needed-trusty
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (38.9 KiB)

This bug was fixed in the package linux - 3.13.0-35.62

---------------
linux (3.13.0-35.62) trusty; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1357148

  [ Brad Figg ]

  * Start new release

  [ dann frazier ]

  * SAUCE: (no-up) Fix build failure on arm64
    - LP: #1353657
  * [debian] Allow for package revisions condusive for branching

  [ David Henningsson ]

  * SAUCE: Call broadwell specific functions from the hda driver
    - LP: #1317865

  [ Edward Lin ]

  * SAUCE: (no-up) Add use native backlight quirk for Dell Inspiron
    5547/5447
    - LP: #1332437

  [ Imre Deak ]

  * SAUCE: drm/i915: move power domain init earlier during system resume
    - LP: #1353405

  [ Jani Nikula ]

  * SAUCE: drm/i915: use lane count and link rate from VBT as minimums for
    eDP
    - LP: #1338582
  * SAUCE: drm/i915/dp: force eDP lane count to max available lanes on BDW
    - LP: #1338582
  * SAUCE: drm/i915: provide interface for audio driver to query cdclk
    - LP: #1188091
  * SAUCE: drm/i915: demote opregion excessive timeout WARN_ONCE to
    DRM_INFO_ONCE
    - LP: #1351014

  [ Joseph Salisbury ]

  * [Config] updateconfigs after Linux 3.13.11.6 updates

  [ Luis Henriques ]

  * Revert "[Packaging] linux-udeb-flavour -- standardise on linux prefix"

  [ Ming Lei ]

  * Revert "SAUCE: (no-up) ata: Fix the dma state machine lockup for the
    IDENTIFY DEVICE PIO mode command."
    - LP: #1335645

  [ Paulo Zanoni ]

  * SAUCE: drm/i915: consider the source max DP lane count too
    - LP: #1338582

  [ Tim Gardner ]

  * [Config] CONFIG_GPIO_SYSFS=y
    - LP: #1342153
  * [Config] CONFIG_KEYS_DEBUG_PROC_KEYS=y
    - LP: #1344405
  * [Config] updateconfigs
  * [Config] CONFIG_SCSI_IPR_TRACE=y, CONFIG_SCSI_IPR_DUMP=y
    - LP: #1343109
  * [Config] CONFIG_CONTEXT_TRACKING_FORCE=n
    - LP: #1349028

  [ Timo Aaltonen ]

  * SAUCE: Fix a typo in hda i915_bdw support.
    - LP: #1343140

  [ Upstream Kernel Changes ]

  * Revert "net/mlx4_en: Fix bad use of dev_id"
    - LP: #1347012
  * Revert "ACPI / AC: Remove AC's proc directory."
    - LP: #1356913
  * Revert "mac80211: move "bufferable MMPDU" check to fix AP mode scan"
    - LP: #1356913
  * mm, pcp: allow restoring percpu_pagelist_fraction default
    - LP: #1347088
  * net: Fix permission check in netlink_connect()
    - LP: #1312989
  * netlink: Rename netlink_capable netlink_allowed
    - LP: #1312989
  * net: Move the permission check in sock_diag_put_filterinfo to
    packet_diag_dump
    - LP: #1312989
  * net: Add variants of capable for use on on sockets
    - LP: #1312989
  * net: Add variants of capable for use on netlink messages
    - LP: #1312989
  * net: Use netlink_ns_capable to verify the permisions of netlink
    messages
    - LP: #1312989
  * netlink: Only check file credentials for implicit destinations
    - LP: #1312989
  * igb: fix stats for i210 rx_fifo_errors
    - LP: #1338893
  * HID: use multi input quirk for 22b9:2968
    - LP: #1339567
  * crypto/nx: disable NX on little endian builds
    - LP: #1338666
  * ACPI / video: Add Dell Inspiron 5737 to the blacklist
    - LP: #1250401
  * Input: elantech - deal with clickpads reportin...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.