ubuntu guest with 10G n/w and Texan iSCSI crashes during FIO

Bug #1517142 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Taco Screen team
Vivid
Fix Released
Undecided
Unassigned
Wily
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Medium
Taco Screen team

Bug Description

Issues were found in iSCSI tests with hardware remote targets. Specifically, kernel crash happens due to dereferencing a null pointer (sc->device->lun at libiscsi.c:369, with sc==NULL). During the crash, lots of messages regarding lists invalid accesses are showed in kernel log.

The commit 659743b02c41 ("[SCSI] libiscsi: Reduce locking contention in fast path") appears to be the cause.

Reverting the commit solve the issue, at least until we can discuss and
find the exact problem (and its solution) in the commit
659743b02c41 ("[SCSI] libiscsi: Reduce locking contention in fast path").

A test kernel was patched to revert the offend commit - Prashantha is running tests to check if the problem is solved.

With the patched kernel, I am unable to recreate the crash. The patch appears
to be working.

A discussion is ongoing in linux-scsi mailing list, to revert the patch upstream (look the following link).

http://marc.info/?l=linux-scsi&m=144730474819919

Another quick discussion, started by me, it's on open-iscsi mailing list, on Google Groups:

https://groups.google.com/forum/#!topic/open-iscsi/0S5fEM_Aafk

The iscsi maintainer wants to revert, but patch co-author wants more study before reverting. Prashantha is performing some performance analysis to check the impact of the patch on iscsi performance.

Mirroring to Launchpad for Canonical's awareness. Once the discussion settles on the final solution, a patch or link to the upstream commit will be provided for Canonical to review for acceptance in the 14.04 LTS kernel and SRU.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-133001 severity-critical targetmilestone-inin14043
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
penalvch (penalvch)
tags: added: bisect-done
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Revision history for this message
Michael Hohnbaum (hohnbaum) wrote :

IBM, any updates on the recommended fix for this issue?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-01-13 06:36 EDT-------
(In reply to comment #12)
> IBM, any updates on the recommended fix for this issue?

Hello, thanks for bumping this.
The fix right now is to revert the commit mentioned above: 659743b02c41 ("[SCSI] libiscsi: Reduce locking contention in fast path").

There is upstream discussion on this, the iSCSI maintainer proposed a patch to reverting it. The discussion is at http://marc.info/?l=linux-scsi&m=144730474819919 .

The commit's author don't want to revert the patch, and the discussion is stalled. We already reverted it in our internal distro, so it is our recommendation.

Cheers,

Guilherme

Revision history for this message
Tim Gardner (timg-tpi) wrote :

What kernel versions are affected ? Commit 659743b02c411075b26601725947b21df0bb29c8. has been mainline since v3.15

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-14 14:34 EDT-------
(In reply to comment #14)
> What kernel versions are affected ? Commit
> 659743b02c411075b26601725947b21df0bb29c8. has been mainline since v3.15

Every kernel containing this commit should be affected, unfortunately.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-21 10:10 EDT-------
The commit's author is not too "friendly" to the idea of reverting it. He keeps asking for more evidence that it crashes the system. We can easily reproduce it on our side, and by reverting the commit in Ubuntu source, we can build a kernel that does not crash anymore. The commit's author didn't provide any fix for this, things are in the same page as before.

So, what do you think about reverting this in Ubuntu even if in mainline it's not reverted yet? The iSCSI maintainer is friendly to the idea of reverting too.

In IBM's internal distro we already reverted. My recommendation is to revert in Ubuntu, if you agree.

Thanks,

Guilherme

Brad Figg (brad-figg)
Changed in linux (Ubuntu Vivid):
status: New → Fix Committed
Revision history for this message
Tim Gardner (timg-tpi) wrote :

That commit has now been reverted in Vivid, Wily, and Xenial.

Changed in linux (Ubuntu Wily):
status: New → Fix Committed
Changed in linux (Ubuntu Xenial):
status: Triaged → Fix Committed
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid verification-needed-wily
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-22 09:29 EDT-------
Thanks very much. Do you know exactly which versions of kernels 3.19 (Vivid) and 4.2 (Wily) will contain the patch reverted, so we can test them?

Cheers,

Guilherme

Revision history for this message
Tim Gardner (timg-tpi) wrote :

Vivid 3.19.0-49.55
Wily 4.2.0-27.32

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-28 09:08 EDT-------
I have tested the proposed kernel and I was able to test it without any issues... The crash was not recreated.

Brad Figg (brad-figg)
tags: added: verification-done-vivid verification-done-wily
removed: verification-needed-vivid verification-needed-wily
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (17.3 KiB)

This bug was fixed in the package linux - 4.4.0-2.16

---------------
linux (4.4.0-2.16) xenial; urgency=low

  [ Andy Whitcroft ]

  * Release Tracking Bug
    - LP: #1539090
  * SAUCE: hv: hv_set_ifconfig -- convert to python3
    - LP: #1506521
  * SAUCE: dm: introduce a target_ioctl op to allow target specific ioctls
    - LP: #1538618

  [ Colin Ian King ]

  * SAUCE: ACPI / tables: Add acpi_force_32bit_fadt_addr option to force 32
    bit FADT addresses (LP: #1529381)
    - LP: #1529381

  [ John Johansen ]

  * SAUCE: (no-up): apparmor: fix for failed mediation of socket that is
    being shutdown
    - LP: #1446906

  [ Mahesh Salgaonkar ]

  * SAUCE: Powernv: Remove the usage of PACAR1 from opal wrappers
    - LP: #1537881
  * SAUCE: powerpc/book3s: Fix TB corruption in guest exit path on HMI
    interrupt.
    - LP: #1537881
  * SAUCE: KVM: PPC: Book3S HV: Fix soft lockups in KVM on HMI for time
    base errors
    - LP: #1537881

  [ Paolo Pisati ]

  * SAUCE: arm64: errata: Add -mpc-relative-literal-loads to erratum
    #843419 build flags
    - LP: #1533009
  * [Config] MFD_TPS65217=y && REGULATOR_TPS65217=y
  * [Config] disable ARCH_ZX (ZTE ZX Soc)

  [ Tim Gardner ]

  * Revert "SAUCE: (noup) cxlflash: a couple off by one bugs"
  * SAUCE: (no-up) Update bnx2x firmware to 7.12.30.0
    - LP: #1536719
  * SAUCE: drop obsolete bnx2x firmware
  * SAUCE: i40e: Silence 'may be used uninitialized' warnings
    - LP: #1536474
  * [Config] CONFIG_ZONE_DMA=y for amd64 lowlatency
    - LP: #1534647
  * [Config] Add pvpanic to virtual flavour
    - LP: #1537923
  * [Config] CONFIG_INTEL_PUNIT_IPC=m, CONFIG_INTEL_TELEMETRY=m
    - LP: #1520457

  [ Upstream Kernel Changes ]

  * i40evf: fix compiler warning of unused variable
    - LP: #1536474
  * intel: i40e: fix confused code
    - LP: #1536474
  * i40e/i40evf: remove unused tunnel parameter
    - LP: #1536474
  * i40e: Change BUG_ON to WARN_ON in service event complete
    - LP: #1536474
  * i40e: remove BUG_ON from feature string building
    - LP: #1536474
  * i40e: remove BUG_ON from FCoE setup
    - LP: #1536474
  * i40e: Workaround fix for mss < 256 issue
    - LP: #1536474
  * i40e/i40evf: Add a stat to track how many times we have to do a force
    WB
    - LP: #1536474
  * i40e: Move the saving of old link info from handle_link_event to
    link_event
    - LP: #1536474
  * i40e/i40evf: Add comment to #endif
    - LP: #1536474
  * i40e/i40evf: clean up error messages
    - LP: #1536474
  * i40evf: handle many MAC filters correctly
    - LP: #1536474
  * i40e: return the number of enabled queues for ETHTOOL_GRXRINGS
    - LP: #1536474
  * i40e: rework the functions to configure RSS with similar parameters
    - LP: #1536474
  * i40e: create a generic configure rss function
    - LP: #1536474
  * i40e: Bump version to 1.4.2
    - LP: #1536474
  * i40e: add new fields to store user configuration
    - LP: #1536474
  * i40e: rename rss_size to alloc_rss_size in i40e_pf
    - LP: #1536474
  * i40e/i40evf: Fix RS bit update in Tx path and disable force WB
    workaround
    - LP: #1536474
  * i40e/i40evf: prefetch skb data on transmit
    - LP: #1536474
  * i40evf: rename VF adapter s...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-49.55

---------------
linux (3.19.0-49.55) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1536775

  [ Colin Ian King ]

  * SAUCE: (no-up) ACPI / tables: Add acpi_force_32bit_fadt_addr option to
    force 32 bit FADT addresses
    - LP: #1529381

  [ Tim Gardner ]

  * [Config] Add DRM ast driver to udeb installer image
    - LP: #1514711
  * SAUCE: (no-up) Revert "[SCSI] libiscsi: Reduce locking contention in
    fast path"
    - LP: #1517142

  [ Upstream Kernel Changes ]

  * powerpc/eeh: Fix recursive fenced PHB on Broadcom shiner adapter
    - LP: #1532942
  * Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors
    - LP: #1440103
  * Drivers: hv: vmbus: teardown hv_vmbus_con workqueue and
    vmbus_connection pages on shutdown
    - LP: #1440103
  * drivers: hv: vmbus: Teardown synthetic interrupt controllers on module
    unload
    - LP: #1440103
  * clockevents: export clockevents_unbind_device instead of
    clockevents_unbind
    - LP: #1440103
  * Drivers: hv: vmbus: Teardown clockevent devices on module unload
    - LP: #1440103
  * Drivers: hv: vmbus: Add support for VMBus panic notifier handler
    - LP: #1440103
  * hv: run non-blocking message handlers in the dispatch tasklet
    - LP: #1440103
  * Drivers: hv: vmbus: unregister panic notifier on module unload
    - LP: #1440103
  * Drivers: hv: vmbus: Implement the protocol for tearing down vmbus state
    - LP: #1440103
  * kexec: define kexec_in_progress in !CONFIG_KEXEC case
    - LP: #1440103
  * Drivers: hv: vmbus: add special kexec handler
    - LP: #1440103
  * Drivers: hv: don't do hypercalls when hypercall_page is NULL
    - LP: #1440103
  * Drivers: hv: vmbus: add special crash handler
    - LP: #1440103
  * Drivers: hv: vmbus: prefer 'die' notification chain to 'panic'
    - LP: #1440103
  * hyperv: Implement netvsc_get_channels() ethool op
    - LP: #1494423
  * hv_netvsc: Properly size the vrss queues
    - LP: #1494423
  * hv_netvsc: Allocate the sendbuf in a NUMA aware way
    - LP: #1494423
  * hv_netvsc: Allocate the receive buffer from the correct NUMA node
    - LP: #1494423
  * Drivers: hv: vmbus: Implement NUMA aware CPU affinity for channels
    - LP: #1494423
  * Drivers: hv: vmbus: Allocate ring buffer memory in NUMA aware fashion
    - LP: #1494423
  * Drivers: hv: vmbus: Improve the CPU affiliation for channels
    - LP: #1494423
  * Drivers: hv: vmbus: Further improve CPU affiliation logic
    - LP: #1494423

linux (3.19.0-48.54) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1536124
  * Merged back Ubuntu-3.19.0-46.52

 -- Brad Figg <email address hidden> Thu, 21 Jan 2016 12:29:48 -0800

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.2.0-27.32

---------------
linux (4.2.0-27.32) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1536867

  [ Andy Whitcroft ]

  * SAUCE: (no-up) add compat_uts_machine= kernel command line override
    - LP: #1520627

  [ Colin Ian King ]

  * SAUCE: (no-up) ACPI / tables: Add acpi_force_32bit_fadt_addr option to
    force 32 bit FADT addresses
    - LP: #1529381

  [ Eric Dumazet ]

  * SAUCE: (no-up) udp: properly support MSG_PEEK with truncated buffers
    - LP: #1527902

  [ Guilherme G. Piccoli ]

  * SAUCE: powerpc/eeh: Validate arch in eeh_add_device_early()
    - LP: #1486180

  [ Tim Gardner ]

  * SAUCE: (no-up) Revert "[SCSI] libiscsi: Reduce locking contention in
    fast path"
    - LP: #1517142
  * [Config] Add DRM ast driver to udeb installer image
    - LP: #1514711

  [ Upstream Kernel Changes ]

  * net/mlx5e: Re-eanble client vlan TX acceleration
    - LP: #1533249
  * net/mlx5e: Fix LSO vlan insertion
    - LP: #1533249
  * net/mlx5e: Fix inline header size calculation
    - LP: #1533249
  * net: usb: cdc_ncm: Adding Dell DW5812 LTE Verizon Mobile Broadband Card
    - LP: #1533118
  * net: usb: cdc_ncm: Adding Dell DW5813 LTE AT&T Mobile Broadband Card
    - LP: #1533118
  * powerpc/eeh: Fix recursive fenced PHB on Broadcom shiner adapter
    - LP: #1532942

linux (4.2.0-26.31) wily; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1535795
  * Merged back Ubuntu-4.2.0-25.30

 -- Brad Figg <email address hidden> Thu, 21 Jan 2016 18:44:37 -0800

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-08 12:08 EDT-------
Thanks for adding the patch, will close the bug now.

Cheers,

Guilherme

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.