ipmi - Power management trips over itself when BMC doesn't properly support --cycle --on-if-off

Bug #1609496 reported by Nathaniel W. Turner
50
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Wishlist
Andres Rodriguez
2.2
Fix Released
High
Andres Rodriguez

Bug Description

I seem to have at least one system with a BMC that handles ('/usr/sbin/ipmipower', '-W', 'opensesspriv', '--driver-type', 'LAN_2_0', '-h', '1.2.3.4', '-u', 'maas', '-p', 'xxxxxxxxx', '--cycle', '--on-if-off') poorly. Based on it's name, I'd expect that command to have no effect if the power is already on, and power it on if it's off. This node doesn't handle it that way.

What happens is that if the power is on when this command is issued, it cycles off for a bit, then back on. The period of this cycle is just beyond the max delay between the command and the --stat check in maas.drivers.power (it usually takes this node about 15 seconds for power to settle back into the "on" state; the max delay is 12 seconds).

So MAAS issues --cycle --on-if-off, waits 4 seconds, then issues --stat. Sees power is off.
Then the same thing, but with an 8 second wait, then with a 12 second wait. Each --cycle pushes out the time at which the node will be 'on' by about 15 seconds. This causes MAAS to fail to see the power ever transition to on, and conclude it failed to power on the node.

It seems like this is a defect (or perhaps simply a missing feature?) in this server's BMC firmware. However, I suspect this may not be uncommon in the wild, and perhaps MAAS could be more resilient in the face of such behavior.

One simple workaround might be to increase the final delay time in IPMIPowerDriver.wait_times to be fairly large; for example wait_times = (4, 8, 24). There are probably more elegant ways to deal with this, though, perhaps with more frequent --stat checks between --cycle commands, so the full wait isn't taken if the BMC responds quickly.

n

--
I can attach some info about this specific BMC if that's helpful --- what's the best way to obtain this? The commissioning output doesn't seem to have anything that looks relevant.

nturner@maas1:/usr/lib/python3/dist-packages/provisioningserver/drivers/power$ dpkg -l '*maas*'|cat
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==================================-============-=================================================
ii maas 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS server common files
ii maas-dhcp 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS DHCP server
ii maas-dns 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS DNS server
ii maas-proxy 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0~rc3+bzr5180-0ubuntu2~16.04.1 all MAAS server provisioning libraries (Python 3)
nturner@maas1:/usr/lib/python3/dist-packages/provisioningserver/drivers/power$

Tags: ipmi papercutt

Related branches

Changed in maas:
importance: Undecided → Wishlist
milestone: none → 2.1.0
status: New → Triaged
Changed in maas:
milestone: 2.0.1 → 2.1.0
Revision history for this message
Nathaniel W. Turner (nturner) wrote :

In the short term, would it be possible to increase the final IPMIPowerDriver.wait_times value?

Something like (4, 8, 16) isn't a huge change and would probably be good enough for many systems, and has a nice exponential shape to it...

Changed in maas:
milestone: 2.1.0 → 2.1.1
Changed in maas:
milestone: 2.1.1 → 2.1.2
Changed in maas:
milestone: 2.1.2 → 2.1.3
Changed in maas:
milestone: 2.1.3 → 2.2.0
tags: added: papercutt
Revision history for this message
Nathaniel W. Turner (nturner) wrote :

For the benefit of anyone else affected by this, I've attached a simple ansible role to work around this issue for now.

Changed in maas:
milestone: 2.2.0 → 2.2.x
Changed in maas:
milestone: 2.2.x → 2.3.0
summary: - Power management trips over itself when BMC doesn't properly support
- --cycle --on-if-off
+ ipmi - Power management trips over itself when BMC doesn't properly
+ support --cycle --on-if-off
tags: added: ipmi
Changed in maas:
assignee: nobody → Andres Rodriguez (andreserl)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.3.0 → 2.3.0alpha1
Changed in maas:
status: Fix Committed → Fix Released
Revision history for this message
TripleO (ooyekanmi) wrote :

See my comment at https://bugs.launchpad.net/maas/+bug/1703671/comments/3 for a different scenario that can manifest similar commissioning failure problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.