Comment 73 for bug 1569925

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay, I removed almost everything out of equation:

- removed networking interfaces from systemd
- removed open-iscsi login/logout logic from systemd
- used a single network interface for the logins, on a single iscsi portal

was still able to reproduce the issue by:

- doing a simple login after configuring the network device:
  ./net-start.sh ; sleep 1 ; sudo iscsiadm -m node --login

- shutting down the network device before any logout:
  ./net-stop.sh ; sudo shutdown -h now

There was no systemd service order in play (in between open-iscsi, iscsid and networking / ifupdown scripts) and I was able to cause the same issue 100%, same messages, same symptoms. This tells us that, definitely turning down the network while still logged iscsi sessions is causing the hangs (check #3 for why).

Summary of Causes

1)

network shuts down -> iscsi luns are logged out (attempt) -> iscsid daemon is shutdown
or
network shuts down -> iscsi daemon is shutdown -> iscsi luns are logged out

- logout would fail due to lack of network
- iscsi daemon would die and/or logout wouldn't work
- iscsi sessions would still be there
- system would hang (transport layer can't logout)

2)

iscsi daemon is shutdown -> iscsi luns are logged out -> network shuts down

- this would cause the bug i'm mentioning: daemon dies, logout doesn't work.
- some iscsi sessions would still be there
- system would hang (transport layer can't logout)
- (to check: open-iscsi.service ExecStop= running in parallel ?!?)

3)

I used KVM watchdog virtual device + NMI from host to crash the guest during the hang

http://pastebin.ubuntu.com/25394744/

And, finally, the hang is because kernel is hanged during its shutdown logic.

> 0 0 0 ffffffff81e11500 RU 0.0 0 0 [swapper/0]
> 0 0 1 ffff8801a6a20e00 RU 0.0 0 0 [swapper/1]
> 0 0 2 ffff8801a6a21c00 RU 0.0 0 0 [swapper/2]
> 0 0 3 ffff8801a6a22a00 RU 0.0 0 0 [swapper/3]

ALL CPUs were idling during the hang.

crash> runq
CPU 0 RUNQUEUE: ffff8801ae016e00
  CURRENT: PID: 0 TASK: ffffffff81e11500 COMMAND: "swapper/0"
  RT PRIO_ARRAY: ffff8801ae016fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae016e98
     [no tasks queued]

CPU 1 RUNQUEUE: ffff8801ae096e00
  CURRENT: PID: 0 TASK: ffff8801a6a20e00 COMMAND: "swapper/1"
  RT PRIO_ARRAY: ffff8801ae096fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae096e98
     [no tasks queued]

CPU 2 RUNQUEUE: ffff8801ae116e00
  CURRENT: PID: 0 TASK: ffff8801a6a21c00 COMMAND: "swapper/2"
  RT PRIO_ARRAY: ffff8801ae116fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae116e98
     [no tasks queued]

CPU 3 RUNQUEUE: ffff8801ae196e00
  CURRENT: PID: 0 TASK: ffff8801a6a22a00 COMMAND: "swapper/3"
  RT PRIO_ARRAY: ffff8801ae196fb0
     [no tasks queued]
  CFS RB_ROOT: ffff8801ae196e98
     [no tasks queued]

NO task was scheduled to run.

crash> ps -u
   PID PPID CPU TASK ST %MEM VSZ RSS COMM
      1 0 1 ffff8801a69b8000 UN 0.0 15484 3204 systemd-shutdow

There was just ONE SINGLE user task running (systemd-shutdown)

crash> set ffff8801a69b8000
    PID: 1
COMMAND: "systemd-shutdow"
   TASK: ffff8801a69b8000 [THREAD_INFO: ffff8801a69c0000]
    CPU: 1
  STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 1 TASK: ffff8801a69b8000 CPU: 1 COMMAND: "systemd-shutdow"
 #0 [ffff8801a69c3a30] __schedule at ffffffff8183e9ee
 #1 [ffff8801a69c3a80] schedule at ffffffff8183f0d5
 #2 [ffff8801a69c3a98] schedule_timeout at ffffffff81842199
 #3 [ffff8801a69c3b40] io_schedule_timeout at ffffffff8183e604
 #4 [ffff8801a69c3b70] wait_for_completion_io_timeout at ffffffff8183fc6c
 #5 [ffff8801a69c3bd0] blk_execute_rq at ffffffff813cfe10
 #6 [ffff8801a69c3c88] scsi_execute at ffffffff815c3fc7
 #7 [ffff8801a69c3cc8] scsi_execute_req_flags at ffffffff815c60fe
 #8 [ffff8801a69c3d30] sd_sync_cache at ffffffff815d37d7
 #9 [ffff8801a69c3da8] sd_shutdown at ffffffff815d3c3c
#10 [ffff8801a69c3dc0] device_shutdown at ffffffff8156112c
#11 [ffff8801a69c3df0] kernel_power_off at ffffffff810a32f5
#12 [ffff8801a69c3e00] SYSC_reboot at ffffffff810a34d3
#13 [ffff8801a69c3f40] sys_reboot at ffffffff810a359e
#14 [ffff8801a69c3f50] entry_SYSCALL_64_fastpath at ffffffff818431f2
    RIP: 00007face7d8f856 RSP: 00007ffd4c271d08 RFLAGS: 00000202
    RAX: ffffffffffffffda RBX: 00007ffd4c271240 RCX: 00007face7d8f856
    RDX: 000000004321fedc RSI: 0000000028121969 RDI: fffffffffee1dead
    RBP: 00007ffd4c271230 R8: 0000000000001400 R9: 0000000000000005
    R10: 00007face88d36c0 R11: 0000000000000202 R12: 00007ffd4c2713d0
    R13: 000000b7e6795c83 R14: 00007ffd4c271c30 R15: 0000000000000001
    ORIG_RAX: 00000000000000a9 CS: 0033 SS: 002b

And it called the "kernel_power_off" logic from a system call.

This logic iterates over all devices calling:

if (dev->bus && dev->bus->shutdown) {
    dev->bus->shutdown(dev);
} else if (dev->driver && dev->driver->shutdown) {
    dev->driver->shutdown(dev);
}

This called sd_shutdown, which called sd_sync_cache, which sent a SYNCHRONIZE_CACHE scsi op code to the device in question (one of the iscsi paths). scsi_execute called blk_execute_rq to put the request on device's queue, for execution.

/* Prevent hang_check timer from firing at us during very long I/O */
hang_check = sysctl_hung_task_timeout_secs;
if (hang_check)
    while (!wait_for_completion_io_timeout(&wait, hang_check * (HZ/2)));

The default timeout for that operation is hang_check 120 seconds * 250/2 AND, "wait_for_completion_io_timeout" returns 0, !0 == 1 so.. this will run forever.

So basically kernel is trying o complete SYNCHRONIZE_CACHE operation, the SCSI cmd is already in device's queue and there is no return (for obvious reasons).

Obs 1)
I have tested this with kernel 4.11 with the same symptom. I have to check now: is there any error propagation logic to cancel this so system can shutdown ? We are not getting a SCSI sense message because the "bus" (iscsi/tcp) is gone. Should we hard flush the device's queue ?

Obs 2)
This isn't necessarily related to systemd. IMO this should also be addressed in kernel, allowing a broken transport layer - specially tcp/ip - to allow the shutdown to happen.

Obs 3)
The cause of this can be related to systemd (if there are leftovers) or iscsid daemon (if it died and/or got restarted before logging out. is there a way of keeping history of sessions so it knows what to logout after daemon starts again ?).

I'll get back to this next week, after reading more this dump.