Pacemaker "crm node standby" stops resource successfully, but lrmd still monitors it and causes "Failed actions"

Bug #1353473 reported by Rafael David Tinoco
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

 * Whenever a user uses "crm node standby" the code can make lrmd still
   try to monitor resource put into stand-by and cause error messages.

[Test Case]

 * To use "crm node standby" and check lrmd does not stop monitoring
   not set to stand-by.

[Regression Potential]

 * users already tested and are using in production.
 * based on upstream fixes for lrmd monitoring.
 * potential race conditions (based on upstream history).

[Other Info]

 * Original bug description:

----------------

It was brought to me (~inaddy) the following situation:

""""""

* Environment
Ubuntu 14.04 LTS
Pacemaker 1.1.10+git20130802-1ubuntu2

* Priority
High

* Issue
I used "crm node standby" and the resource(haproxy) was stopped successfully. But lrmd still monitors it and causes "Failed actions".

---------------------------------------
Node A1LB101 (167969461): standby
Online: [ A1LB102 ]

Resource Group: grpHaproxy
vip-internal (ocf::heartbeat:IPaddr2): Started A1LB102
vip-external (ocf::heartbeat:IPaddr2): Started A1LB102
vip-nfs (ocf::heartbeat:IPaddr2): Started A1LB102
vip-iscsi (ocf::heartbeat:IPaddr2): Started A1LB102
Resource Group: grpStonith1
prmStonith1-1 (stonith:external/stonith-helper): Started A1LB102
Clone Set: clnHaproxy [haproxy]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]
Clone Set: clnPing [ping]
Started: [ A1LB102 ]
Stopped: [ A1LB101 ]

Node Attributes:
* Node A1LB101:
* Node A1LB102:
+ default_ping_set : 400

Migration summary:
* Node A1LB101:
haproxy: migration-threshold=1 fail-count=18 last-failure='Mon Jul 7 20:28:58 2014'
* Node A1LB102:

Failed actions:
haproxy_monitor_10000 (node=A1LB101, call=2332, rc=7, status=complete, last-rc-change=Mon Jul 7 20:28:58 2014
, queued=0ms, exec=0ms
): not running
---------------------------------------

Abstract from log (ha-log.node1)
Jul 7 20:28:50 A1LB101 crmd[6364]: notice: te_rsc_command: Initiating action 42: stop haproxy_stop_0 on A1LB101 (local)
Jul 7 20:28:50 A1LB101 crmd[6364]: info: match_graph_event: Action haproxy_stop_0 (42) confirmed on A1LB101 (rc=0)
Jul 7 20:28:58 A1LB101 crmd[6364]: notice: process_lrm_event: A1LB101-haproxy_monitor_10000:1372 [ haproxy not running.\n ]

""""""

I wasn't able to reproduce this error so far but the fix seems a straightforward cherry-picking from upstream patch set fix:

48f90f6 Fix: services: Do not allow duplicate recurring op entries
c29ab27 High: lrmd: Merge duplicate recurring monitor operations
348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed

So I'm assuming (and testing right now) this will fix the issue... Opening the public bug for the fix I'll provide after tests, and to ask others to test the fix also.

Changed in pacemaker (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → Confirmed
description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

## After applying the fix I could successfully put one node on standby. Resources migrated correctly.

root@trustycluster02:~# crm_mon
Connection to the CIB terminated
Reconnecting...root@trustycluster02:~# crm_mon -1
Last updated: Wed Aug 6 10:27:35 2014
Last change: Tue Aug 5 15:42:11 2014 via crm_attribute on trustycluster04
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Online: [ trustycluster02 trustycluster03 trustycluster04 ]

 p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
 p_fence_cluster02 (stonith:external/vcenter): Started trustycluster03
 p_fence_cluster03 (stonith:external/vcenter): Started trustycluster04
 p_fence_cluster04 (stonith:external/vcenter): Started trustycluster02
 clusterip (ocf::heartbeat:IPaddr2): Started trustycluster03

## and resources were active in other nodes:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug 6 10:29:48 2014
Last change: Wed Aug 6 10:27:47 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Node trustycluster01 (739246087): standby
Node trustycluster03 (739246089): standby
Online: [ trustycluster02 trustycluster04 ]

 p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
 p_fence_cluster02 (stonith:external/vcenter): Started trustycluster04
 p_fence_cluster03 (stonith:external/vcenter): Started trustycluster04
 p_fence_cluster04 (stonith:external/vcenter): Started trustycluster02
 clusterip (ocf::heartbeat:IPaddr2): Started trustycluster02

## After putting nodes back online:

root@trustycluster01:~# crm_mon -1
Last updated: Wed Aug 6 10:30:42 2014
Last change: Wed Aug 6 10:30:36 2014 via crm_attribute on trustycluster01
Stack: corosync
Current DC: trustycluster02 (739246088) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
5 Resources configured

Online: [ trustycluster01 trustycluster02 trustycluster03 trustycluster04 ]

 p_fence_cluster01 (stonith:external/vcenter): Started trustycluster02
 p_fence_cluster02 (stonith:external/vcenter): Started trustycluster04
 p_fence_cluster03 (stonith:external/vcenter): Started trustycluster01
 clusterip (ocf::heartbeat:IPaddr2): Started trustycluster01

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Created one public PPA so the SRU proposal can be tested before asking for sponsorship:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1353473

# apt-add-repository ppa:inaddy/lp1353473
# apt-get update
# apt-get dist-upgrade

* attention: this will replace current trusty pacemaker version:
pacemaker_1.1.10+git20130802-1ubuntu2

* to version:
pacemaker_1.1.10+git20130802-1ubuntu3

* because versioning is already ready for the SRU proposal.
* to get back to current trusty version you will have to remove
* the pacemaker by hand and install it again (maybe ignoring
* dependencies if you don't want to reinstall hole clustering
* packages).

After upgrading to version: pacemaker_1.1.10+git20130802-1ubuntu3

Anyone who is suffering for this issue can try to
# "crm node standby <node>"
again and check if ldmd stops monitoring resources on nodes put to standby.

Tks

description: updated
description: updated
description: updated
description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains 48f90f6
<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains c29ab27
<email address hidden>:/bugs/00070403/sources/upstream$ git tag --contains 348bb51

Pacemaker-1.1.12
Pacemaker-1.1.12-rc1
Pacemaker-1.1.12-rc2
Pacemaker-1.1.12-rc3
Pacemaker-1.1.12-rc4

Affects Trusty and Utopic.

description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attaching Trusty fix.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attaching Utopic fix.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
summary: - Trusty Pacemaker "crm node standby" stops resource successfully, but
- lrmd still monitors it and causes "Failed actions"
+ Pacemaker "crm node standby" stops resource successfully, but lrmd still
+ monitors it and causes "Failed actions"
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Submitted fix to Debian:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757514

Waiting for fix/merge.

Changed in pacemaker (Debian):
status: Unknown → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-4ubuntu3

---------------
pacemaker (1.1.10+git20130802-4ubuntu3) utopic; urgency=medium

  * Fix: services: Do not allow duplicate recurring op entries - 1/3 (LP: #1353473)
  * High: lrmd: Merge duplicate recurring monitor operations - 2/3 (LP: #1353473)
  * Fix: lrmd: Cancel recurring operations before stop action is executed - 3/3 (LP: #1353473)
 -- Rafael David Tinoco <email address hidden> Thu, 04 Sep 2014 09:58:36 -0500

Changed in pacemaker (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

ACK on the debdiff for trusty. I've uploaded it for processing by the SRU team with a slight change in the version number.

Thanks!

Changed in pacemaker (Ubuntu Trusty):
status: New → In Progress
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Rafael, or anyone else affected,

Accepted pacemaker into trusty-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in pacemaker (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Nobuto Murata (nobuto) wrote :

It works well with scenarios of fresh OpenStack deployments.
 * distro repository: "Failed actions" is observed with `crm node standby`
 * -proposed repository: no "Failed actions" with the same operation

I will try to double-check it in package upgrade scenario if I have time, the proposed package works as expected so far.

Revision history for this message
Nobuto Murata (nobuto) wrote :

fyi, I used an attached juju bundle to prepare environment for testing of the last comment.

Revision history for this message
Nobuto Murata (nobuto) wrote :

Also verified with an upgrade scenario. "Failed actions" is no longer reproducible after upgrading the packages to -proposed.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-1ubuntu2.1

---------------
pacemaker (1.1.10+git20130802-1ubuntu2.1) trusty; urgency=medium

  * Fix: services: Do not allow duplicate recurring op entries - 1/3 (LP: #1353473)
  * High: lrmd: Merge duplicate recurring monitor operations - 2/3 (LP: #1353473)
  * Fix: lrmd: Cancel recurring operations before stop action is executed - 3/3 (LP: #1353473)
 -- Rafael David Tinoco <email address hidden> Wed, 06 Aug 2014 09:24:13 -0300

Changed in pacemaker (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: added: cts
Changed in pacemaker (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → nobody
no longer affects: pacemaker (Debian)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.