Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16 namespaces (Bolt / NVMe)

Bug #1757497 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
Canonical Kernel Team
linux (Ubuntu)
Fix Released
High
Joseph Salisbury
Bionic
Fix Released
High
Joseph Salisbury

Bug Description

---Problem Description---
We are seeing similar IO Hang on some namespaces when running HTX 16 namespaces on Ubuntu18.04

---uname output---
Linux ltciofvtr-spoon4 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:21:52 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
(Bolt / NVMe)0003:01:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa [144d:a822] (rev 01)

Machine Type = AC922

---Steps to Reproduce---
 1> Install Ubuntu18.04 , upgrade to 4.15.0-10 kernel
2> Install htxubuntu-472.deb
3> make sure you create name spaces
#!/bin/bash

device=/dev/nvme0
echo $device

nvme format $device

nvme set-feature $device -f 0x0b --value=0x0100

nvme delete-ns $device -n 0xFFFFFFFF
sleep 5
nvme list

nvme get-log $device -l 200 -i 4

max=`nvme id-ctrl $device | grep ^nn | awk '{print $NF}'`

for i in $(eval echo {1..$max})
do
    echo $i
    nvme create-ns $device --nsze=7000000 --ncap=7000000 --flbas=0 --dps=0
    nvme attach-ns $device --namespace-id=$i --controllers=`nvme list-ctrl $device | awk -F: '{print $2}'`
    sleep 2
    nvme get-log $device -l 200 -i 4
    sleep 2
done
nvme list

3> run mdt.hd on those namespaces

Contact Information = <email address hidden>

Stack trace output:
 ---------------------------------------------------------------------

---------------------------------------------------------------------
Device id:/dev/nvme0n8
Timestamp:Feb 20 16:57:30 2018
err=ffffffff
sev=1
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hardware Exerciser stopped on error

---------------------------------------------------------------------

---------------------------------------------------------------------
Device id:/dev/nvme0n10
Timestamp:Feb 20 16:57:36 2018
err=ffffffff
sev=1
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hung I/O alert! Segment table-1, Detected 1 I/O(s) hung.
Current time: 1519163856; hang criteria: 600 secs, Hard hang threshold: 3
Process ID: 0x8161
       1st lba Blocks Kernel Hang Duration
        (Hex) (Hex) Thread Cnt (Secs)
** Threshold of 1800 secs on one or more I/Os exceeded!
        0x5ae08b 8 7e0457eaf180 4 4800

---------------------------------------------------------------------

---------------------------------------------------------------------
Device id:/dev/nvme0n10
Timestamp:Feb 20 16:57:36 2018
err=ffffffff
sev=1
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hardware Exerciser stopped on error

---------------------------------------------------------------------

---------------------------------------------------------------------
Device id:/dev/nvme0n4
Timestamp:Feb 20 17:14:19 2018
err=ffffffff
sev=4
Exerciser Name:hxestorage
Serial No:Not Available
Part No:Not Available
Location:Not Available
FRU Number:Not Available
Device:Not Available
Error Text:Hung I/O alert! Segment table-1, Detected 1 I/O(s) hung.
Current time: 1519164859; hang criteria: 600 secs, Hard hang threshold: 3
Process ID: 0x815b
       1st lba Blocks Kernel Hang Duration
        (Hex) (Hex) Thread Cnt (Secs)
        0x398a7e 2 71d5affff180 3 3000

---------------------------------------------------------------------

[17643.202114] INFO: task hxestorage:39744 blocked for more than 120 seconds.
[17643.202180] Not tainted 4.15.0-10-generic #11-Ubuntu
[17643.202224] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17643.202342] hxestorage D 0 39744 3424 0x00040000
[17643.202346] Call Trace:
[17643.202352] [c00020382bc4b660] [c00020382bc4b6b0] 0xc00020382bc4b6b0 (unreliable)
[17643.202360] [c00020382bc4b830] [c00000000001c080] __switch_to+0x2a0/0x4d0
[17643.202364] [c00020382bc4b890] [c000000000cfce84] __schedule+0x2a4/0xaf0
[17643.202366] [c00020382bc4b960] [c000000000cfd710] schedule+0x40/0xc0
[17643.202370] [c00020382bc4b980] [c00000000014dffc] io_schedule+0x2c/0x50
[17643.202376] [c00020382bc4b9b0] [c00000000042bf94] __blkdev_direct_IO_simple+0x1d4/0x3e0
[17643.202379] [c00020382bc4bae0] [c00000000042c500] blkdev_direct_IO+0x360/0x540
[17643.202384] [c00020382bc4bbb0] [c0000000002dc1f8] generic_file_direct_write+0xc8/0x240
[17643.202387] [c00020382bc4bc20] [c0000000002dc47c] __generic_file_write_iter+0x10c/0x2a0
[17643.202391] [c00020382bc4bc80] [c00000000042da3c] blkdev_write_iter+0xac/0x160
[17643.202394] [c00020382bc4bcf0] [c0000000003cc3f4] new_sync_write+0x104/0x160
[17643.202397] [c00020382bc4bd80] [c0000000003cfb38] vfs_write+0xd8/0x220
[17643.202401] [c00020382bc4bdd0] [c0000000003d00b4] SyS_pwrite64+0xc4/0xf0
[17643.202405] [c00020382bc4be30] [c00000000000b184] system_call+0x58/0x6c
[17643.202408] INFO: task hxestorage:39748 blocked for more than 120 seconds.
[17643.202519] Not tainted 4.15.0-10-generic #11-Ubuntu
[17643.202587] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17643.202692] hxestorage D 0 39748 3424 0x00040000
[17643.202695] Call Trace:
[17643.202697] [c00020382bc6f660] [c00020382bc6f6b0] 0xc00020382bc6f6b0 (unreliable)
[17643.202701] [c00020382bc6f830] [c00000000001c080] __switch_to+0x2a0/0x4d0
[17643.202703] [c00020382bc6f890] [c000000000cfce84] __schedule+0x2a4/0xaf0
[17643.202705] [c00020382bc6f960] [c000000000cfd710] schedule+0x40/0xc0
[17643.202708] [c00020382bc6f980] [c00000000014dffc] io_schedule+0x2c/0x50
[17643.202711] [c00020382bc6f9b0] [c00000000042bf94] __blkdev_direct_IO_simple+0x1d4/0x3e0
[17643.202714] [c00020382bc6fae0] [c00000000042c500] blkdev_direct_IO+0x360/0x540
[17643.202717] [c00020382bc6fbb0] [c0000000002dc1f8] generic_file_direct_write+0xc8/0x240
[17643.202720] [c00020382bc6fc20] [c0000000002dc47c] __generic_file_write_iter+0x10c/0x2a0
[17643.202723] [c00020382bc6fc80] [c00000000042da3c] blkdev_write_iter+0xac/0x160
[17643.202726] [c00020382bc6fcf0] [c0000000003cc3f4] new_sync_write+0x104/0x160
[17643.202729] [c00020382bc6fd80] [c0000000003cfb38] vfs_write+0xd8/0x220
[17643.202732] [c00020382bc6fdd0] [c0000000003d00b4] SyS_pwrite64+0xc4/0xf0
[17643.202735] [c00020382bc6fe30] [c00000000000b184] system_call+0x58/0x6c
[17643.202740] INFO: task hxestorage:39917 blocked for more than 120 seconds.
[17643.202809] Not tainted 4.15.0-10-generic #11-Ubuntu
[17643.202882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17643.203013] hxestorage D 0 39917 3424 0x00040000
[17643.203015] Call Trace:
[17643.203017] [c00020382bcd3720] [0000003c00000000] 0x3c00000000 (unreliable)
[17643.203021] [c00020382bcd38f0] [c00000000001c080] __switch_to+0x2a0/0x4d0
[17643.203023] [c00020382bcd3950] [c000000000cfce84] __schedule+0x2a4/0xaf0
[17643.203025] [c00020382bcd3a20] [c000000000cfd710] schedule+0x40/0xc0
[17643.203027] [c00020382bcd3a40] [c00000000014dffc] io_schedule+0x2c/0x50
[17643.203030] [c00020382bcd3a70] [c00000000042bf94] __blkdev_direct_IO_simple+0x1d4/0x3e0
[17643.203033] [c00020382bcd3ba0] [c00000000042c500] blkdev_direct_IO+0x360/0x540
[17643.203036] [c00020382bcd3c70] [c0000000002dbfdc] generic_file_read_iter+0xbc/0x210
[17643.203040] [c00020382bcd3cd0] [c00000000042d1e0] blkdev_read_iter+0x50/0x80
[17643.203043] [c00020382bcd3cf0] [c0000000003cc290] new_sync_read+0x100/0x160
[17643.203046] [c00020382bcd3d80] [c0000000003cf74c] vfs_read+0xbc/0x1b0
[17643.203049] [c00020382bcd3dd0] [c0000000003cffc4] SyS_pread64+0xc4/0xf0
[17643.203052] [c00020382bcd3e30] [c00000000000b184] system_call+0x58/0x6c
[17643.203056] INFO: task hxestorage:40049 blocked for more than 120 seconds.

Possible patch being reviewed for this issue:

http://linuxppc.10917.n7.nabble.com/PATCH-powerpc-64s-Fix-lost-pending-interrupt-due-to-race-causing-lost-update-to-irq-happened-td135119.html

Revision history for this message
bugproxy (bugproxy) wrote : sos report

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-164942 severity-high targetmilestone-inin1804
Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
tags: added: kernel-da-key
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2018-03-23 11:01 EDT-------
Patch accepted upstream in the powerpc tree as git commit
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=fixes&id=ff6781fd1bb404d8a551c02c35c70cec1da17ff1
("powerpc/64s: Fix lost pending interrupt due to race causing lost update to irq_happened")

Frank Heimes (fheimes)
tags: added: triage-g
Changed in linux (Ubuntu Bionic):
status: Triaged → In Progress
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Triaged → In Progress
Seth Forshee (sforshee)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (40.4 KiB)

This bug was fixed in the package linux - 4.15.0-15.16

---------------
linux (4.15.0-15.16) bionic; urgency=medium

  * linux: 4.15.0-15.16 -proposed tracker (LP: #1761177)

  * FFe: Enable configuring resume offset via sysfs (LP: #1760106)
    - PM / hibernate: Make passing hibernate offsets more friendly

  * /dev/bcache/by-uuid links not created after reboot (LP: #1729145)
    - SAUCE: (no-up) bcache: decouple emitting a cached_dev CHANGE uevent

  * Ubuntu18.04:POWER9:DD2.2 - Unable to start a KVM guest with default machine
    type(pseries-bionic) complaining "KVM implementation does not support
    Transactional Memory, try cap-htm=off" (kvm) (LP: #1752026)
    - powerpc: Use feature bit for RTC presence rather than timebase presence
    - powerpc: Book E: Remove unused CPU_FTR_L2CSR bit
    - powerpc: Free up CPU feature bits on 64-bit machines
    - powerpc: Add CPU feature bits for TM bug workarounds on POWER9 v2.2
    - powerpc/powernv: Provide a way to force a core into SMT4 mode
    - KVM: PPC: Book3S HV: Work around transactional memory bugs in POWER9
    - KVM: PPC: Book3S HV: Work around XER[SO] bug in fake suspend mode
    - KVM: PPC: Book3S HV: Work around TEXASR bug in fake suspend state

  * Important Kernel fixes to be backported for Power9 (kvm) (LP: #1758910)
    - powerpc/mm: Fixup tlbie vs store ordering issue on POWER9

  * Ubuntu 18.04 - IO Hang on some namespaces when running HTX with 16
    namespaces (Bolt / NVMe) (LP: #1757497)
    - powerpc/64s: Fix lost pending interrupt due to race causing lost update to
      irq_happened

  * fwts-efi-runtime-dkms 18.03.00-0ubuntu1: fwts-efi-runtime-dkms kernel module
    failed to build (LP: #1760876)
    - [Packaging] include the retpoline extractor in the headers

linux (4.15.0-14.15) bionic; urgency=medium

  * linux: 4.15.0-14.15 -proposed tracker (LP: #1760678)

  * [Bionic] mlx4 ETH - mlnx_qos failed when set some TC to vendor
    (LP: #1758662)
    - net/mlx4_en: Change default QoS settings

  * AT_BASE_PLATFORM in AUXV is absent on kernels available on Ubuntu 17.10
    (LP: #1759312)
    - powerpc/64s: Fix NULL AT_BASE_PLATFORM when using DT CPU features

  * Bionic update to 4.15.15 stable release (LP: #1760585)
    - net: dsa: Fix dsa_is_user_port() test inversion
    - openvswitch: meter: fix the incorrect calculation of max delta_t
    - qed: Fix MPA unalign flow in case header is split across two packets.
    - tcp: purge write queue upon aborting the connection
    - qed: Fix non TCP packets should be dropped on iWARP ll2 connection
    - sysfs: symlink: export sysfs_create_link_nowarn()
    - net: phy: relax error checking when creating sysfs link netdev->phydev
    - devlink: Remove redundant free on error path
    - macvlan: filter out unsupported feature flags
    - net: ipv6: keep sk status consistent after datagram connect failure
    - ipv6: old_dport should be a __be16 in __ip6_datagram_connect()
    - ipv6: sr: fix NULL pointer dereference when setting encap source address
    - ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
    - mlxsw: spectrum_buffers: Set a minimum quota for CPU port traffic
    - net: phy: Tell caller result ...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.