[bionic] fence_scsi not working properly with Pacemaker 1.1.18-2ubuntu1.1

Bug #1866119 reported by Rafael David Tinoco
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Unassigned
Bionic
Fix Released
Undecided
Unassigned

Bug Description

OBS: This bug was originally into LP: #1865523 but it was split.

#### SRU: pacemaker

[Impact]

 * fence_scsi is not currently working in a share disk environment

 * all clusters relying in fence_scsi and/or fence_scsi + watchdog won't be able to start the fencing agents OR, in worst case scenarios, the fence_scsi agent might start but won't make scsi reservations in the shared scsi disk.

 * this bug is taking care of pacemaker 1.1.18 issues with fence_scsi, since the later was fixed at LP: #1865523.

[Test Case]

 * having a 3-node setup, nodes called "clubionic01, clubionic02, clubionic03", with a shared scsi disk (fully supporting persistent reservations) /dev/sda, with corosync and pacemaker operational and running, one might try:

rafaeldtinoco@clubionic01:~$ crm configure
crm(live)configure# property stonith-enabled=on
crm(live)configure# property stonith-action=off
crm(live)configure# property no-quorum-policy=stop
crm(live)configure# property have-watchdog=true
crm(live)configure# commit
crm(live)configure# end
crm(live)# end

rafaeldtinoco@clubionic01:~$ crm configure primitive fence_clubionic \
    stonith:fence_scsi params \
    pcmk_host_list="clubionic01 clubionic02 clubionic03" \
    devices="/dev/sda" \
    meta provides=unfencing

And see the following errors:

Failed Actions:
* fence_clubionic_start_0 on clubionic02 'unknown error' (1): call=6, status=Error, exitreason='',
    last-rc-change='Wed Mar 4 19:53:12 2020', queued=0ms, exec=1105ms
* fence_clubionic_start_0 on clubionic03 'unknown error' (1): call=6, status=Error, exitreason='',
    last-rc-change='Wed Mar 4 19:53:13 2020', queued=0ms, exec=1109ms
* fence_clubionic_start_0 on clubionic01 'unknown error' (1): call=6, status=Error, exitreason='',
    last-rc-change='Wed Mar 4 19:53:11 2020', queued=0ms, exec=1108ms

and corosync.log will show:

warning: unpack_rsc_op_failure: Processing failed op start for fence_clubionic on clubionic01: unknown error (1)

[Regression Potential]

 * LP: #1865523 shows fence_scsi fully operational after SRU for that bug is done.

 * LP: #1865523 used pacemaker 1.1.19 (vanilla) in order to fix fence_scsi.

 * There are changes to: cluster resource manager daemon, local resource manager daemon and police engine. From all the changes, the police engine fix is the biggest, but still not big for a SRU. This could cause police engine, thus cluster decisions, to mal function.

 * All patches are based in upstream fixes made right after Pacemaker-1.1.18, used by Ubuntu Bionic and were tested with fence_scsi to make sure it fixed the issues.

[Other Info]

 * Original Description:

Trying to setup a cluster with an iscsi shared disk, using fence_scsi as the fencing mechanism, I realized that fence_scsi is not working in Ubuntu Bionic. I first thought it was related to Azure environment (LP: #1864419), where I was trying this environment, but then, trying locally, I figured out that somehow pacemaker 1.1.18 is not fencing the shared scsi disk properly.

Note: I was able to "backport" vanilla 1.1.19 from upstream and fence_scsi worked. I have then tried 1.1.18 without all quilt patches and it didnt work as well. I think that bisecting 1.1.18 <-> 1.1.19 might tell us which commit has fixed the behaviour needed by the fence_scsi agent.

(k)rafaeldtinoco@clubionic01:~$ crm conf show
node 1: clubionic01.private
node 2: clubionic02.private
node 3: clubionic03.private
primitive fence_clubionic stonith:fence_scsi \
        params pcmk_host_list="10.250.3.10 10.250.3.11 10.250.3.12" devices="/dev/sda" \
        meta provides=unfencing
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.18-2b07d5c5a9 \
        cluster-infrastructure=corosync \
        cluster-name=clubionic \
        stonith-enabled=on \
        stonith-action=off \
        no-quorum-policy=stop \
        symmetric-cluster=true

----

(k)rafaeldtinoco@clubionic02:~$ sudo crm_mon -1
Stack: corosync
Current DC: clubionic01.private (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon Mar 2 15:55:30 2020
Last change: Mon Mar 2 15:45:33 2020 by root via cibadmin on clubionic01.private

3 nodes configured
1 resource configured

Online: [ clubionic01.private clubionic02.private clubionic03.private ]

Active resources:

 fence_clubionic (stonith:fence_scsi): Started clubionic01.private

----

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist --in --read-keys --device=/dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there are NO registered reservation keys

(k)rafaeldtinoco@clubionic02:~$ sudo sg_persist -r /dev/sda
  LIO-ORG cluster.bionic. 4.0
  Peripheral device type: disk
  PR generation=0x0, there is NO reservation held

Related branches

Changed in pacemaker (Ubuntu):
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
status: New → Fix Released
Changed in pacemaker (Ubuntu Bionic):
status: New → Confirmed
description: updated
Changed in pacemaker (Ubuntu Bionic):
status: Confirmed → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

A PPA can be currently found at : https://launchpad.net/~ubuntu-server-ha/+archive/ubuntu/staging

I'm adjusting the SRU but, meanwhile, that PPA provides a working version for Ubuntu Bionic.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
summary: - [bionic] fence_scsi not working properly with 1.1.18-2ubuntu1.1
+ [bionic] fence_scsi not working properly with Pacemaker
+ 1.1.18-2ubuntu1.1
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted pacemaker into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.18-0ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in pacemaker (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
tags: added: block-proposed-bionic
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay, I had verified this from the day in it landed in -proposed. It is working as expected (https://discourse.ubuntu.com/t/ubuntu-high-availability-corosync-pacemaker-shared-disk-environments/). I'm marking this as verification-done as it has stayed in -proposed for sometime now and no bad feedback was given from those who were asked to test it.

tags: added: verification-done verification-done-bionic
removed: block-proposed-bionic verification-needed verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.18-0ubuntu1.2

---------------
pacemaker (1.1.18-0ubuntu1.2) bionic; urgency=medium

  * Pacemaker fixes to allow fence-agents to work correctly (LP: #1866119)
    - d/p/lp1866119-Fix-crmd-avoid-double-free.patch: fix double free
      causing intermittent errors
    - d/p/lp1866119-Fix-attrd-ensure-node-name-is-broadcast.patch: fix
      hang on shutdown issue.
    - d/p/lp1866119-Refactor-pengine-functionize.patch: small needed delta
      to allow the unfence fix.
    - d/p/lp1866119-Fix-pengine-unfence-before-probing.patch: allows
      fence-agents to start correctly (LP #1865523)

 -- Rafael David Tinoco <email address hidden> Fri, 06 Mar 2020 02:28:20 +0000

Changed in pacemaker (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.