IPMI driver does not handle timeouts correctly

Bug #1508741 reported by Andres Rodriguez
28
This bug affects 6 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Newell Jensen
2.0
Fix Released
Critical
Gavin Panella

Bug Description

MAAS is unable to reach the BMCs. However instead of correctly handling
the timeout, it is erroring out. The machines should show a RED for
power error in the UI, but this is not the case. They still show the
status before this issue (which was on). The machines are actually off.

Related branches

Changed in maas:
milestone: none → 1.9.0
importance: Undecided → Critical
Gavin Panella (allenap)
summary: - MAAS ipmi power does not handle timeouts correclt
+ IPMI driver does not handle timeouts correctly
description: updated
description: updated
Revision history for this message
Gavin Panella (allenap) wrote :

A red power icon in the UI denotes that the power status could not be *queried*. This failure arises when trying to *change* the power state.

If the power change that fails was initiated as a result of a status change, e.g. when starting commissioning, the node should be transitioned to the corresponding failed state, e.g. failed commissioning. That will happen when the commissioning monitor times-out, but it may be that we can detect this failure sooner (or already do).

If the power change that fails was initiated by the user then the user is expected to see that the node did not power on or off and then investigate. A log entry is added to the node event log that will help.

This is not a bug.

Changed in maas:
status: New → Invalid
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Right, but that's not the problem.

This is what's happening:

1. The machine is powered off.
2. MAAS is showing the machine as ON
3. MAAS query's power
4. MAAS fails to query power for a connection timeout
5. MAAS still shows the machine as ON.

What does this mean? that MAAS thinkgs it can query the BMC, however, it can't. It fails to query the BMC and that is an error. So the power should be shown as Red because MAAS cannot query the BMC.

Changed in maas:
status: Invalid → New
status: New → Triaged
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Note that the power off that I'm attempting to do here is a manual power off via the webui because MAAS reports the node as being on.

Revision history for this message
Gavin Panella (allenap) wrote :

> Right, but that's not the problem.
>
> This is what's happening:
>
> 1. The machine is powered off.
> 2. MAAS is showing the machine as ON
> 3. MAAS query's power
> 4. MAAS fails to query power for a connection timeout
> 5. MAAS still shows the machine as ON.

The traceback in the description is misleading then, because it is
related only to changing the power state. If you have a traceback that
relates to a query that would be useful. In the meantime I will edit
that traceback from the description.

description: updated
Revision history for this message
Gavin Panella (allenap) wrote :

I wonder if the UI calls MAAS's Web API rather than going via the
WebSocket API for this? If so, I can see an inconsistency in
maasserver.api.nodes:

class NodeHandler(OperationsHandler):
    ...
    @operation(idempotent=True)
    def query_power_state(self, request, system_id):
        ...
        call = client(
            PowerQuery, system_id=system_id, hostname=node.hostname,
            power_type=power_info.power_type,
            context=power_info.power_parameters)
        try:
            state = call.wait(POWER_QUERY_TIMEOUT)
        except crochet.TimeoutError:
            maaslog.error(
                "%s: Timed out waiting for power response in Node.power_state",
                node.hostname)
-->         raise PowerProblem("Timed out waiting for power response")
        except PowerActionFail as e:
            addTask(node.update_power_state, POWER_STATE.ERROR)
            raise PowerProblem(e)
        except NotImplementedError as e:
            addTask(node.update_power_state, POWER_STATE.UNKNOWN)
            raise PowerProblem(e)

Other call-sites for the PowerQuery RPC call deal with time-outs by
setting the power state to "error", whereas this only re-raises the
exception. Adding:

            addTask(node.update_power_state, POWER_STATE.ERROR)

to that except: block might be enough to fix this bug.

Revision history for this message
Gavin Panella (allenap) wrote :

...
> setting the power state to "error", whereas this only re-raises the
                                                        ^^^

s/re-raises the/raises an/

Changed in maas:
assignee: nobody → Ricardo Bánffy (rbanffy)
Changed in maas:
status: Triaged → In Progress
Revision history for this message
Ricardo Bánffy (rbanffy) wrote :

Andres: can you add here the steps you took to make it fail and add relevant logs so we can try to figure out why Gavin's fix didn't work (because it really should)?

Changed in maas:
milestone: 1.9.0 → 1.9.1
Gavin Panella (allenap)
Changed in maas:
status: In Progress → Triaged
assignee: Ricardo Bánffy (rbanffy) → nobody
Gavin Panella (allenap)
Changed in maas:
status: Triaged → Incomplete
Changed in maas:
milestone: 1.9.1 → 1.9.2
Revision history for this message
Andres Rodriguez (andreserl) wrote :

09:50 < ivoks> roaksoax: so, here's how they have set it up
09:51 < ivoks> roaksoax: they have a laptop with wifi and ethernet. ethernet is bridged and MAAS is connected to that bridge. that network is
               used as management network, and they also access to ipmi over that network
09:51 -!- Karlos [<email address hidden>] has joined #maas
09:51 < ivoks> roaksoax: another maas' interface is connected over standard libvirt default network
09:52 < ivoks> roaksoax: and when they unplug cable from laptop, all machines in maas show up as powered on
09:52 < ivoks> roaksoax: to me, this looks like a timeout/broken route

11:27 < ivoks> roaksoax: that's the first thing i checked
11:27 < ivoks> roaksoax: it takes 15-30 seconds, and returns ok
11:27 < ivoks> roaksoax: doing it with api/cli results in
11:27 < ivoks> 17:21 < ivoks> [11:44:16] and if it doesn't have access to ipmi, 'query-power-state' for the node returns 'successs, powered on'

Revision history for this message
Newell Jensen (newell-jensen) wrote :

ivoks,

Some questions so that I can try and reproduce this (having troubles so far):

0. Is it possible to get a routes table for the routes that they have?
1. Is the customer testing all of this on the bare metal host or are they using VMs for everything (or a combination)?
2. Are they installing MAAS in a container, and instructing the container to use the bridge?
3. How is virbr0 being used?
4. What are they running IPMI against? qemu/kvm hosts? Other hardware?

Basically a network diagram would be helpful to show what is exactly going on here. Also, if I could get access to the system that would help a bunch.

Thanks :)

Revision history for this message
Ante Karamatić (ivoks) wrote :

0. No :/ It's a personal laptop.
1. MAAS is running in a VM on a personal laptop. All other machines are physical HP nodes (so, this is more about iLO rather than IPMI)
2. No, MAAS is in a VM (see comment 8 for details on how networking is done) on a laptop
3. virbr0 is a standard libvirt NATed network. That network is used as an external network on MAAS (eth0->virbr0-NAT>laptop's WIFI, eth1->br-mgmt0-bridge>laptop's eth0)
4. They are running iLO/IPMI against physical machines

I don't have network diagram, because this is all hearsay (by three different people with three different laptops/setups). I was planning on reproducing this on my own.

Revision history for this message
Newell Jensen (newell-jensen) wrote :

I was able to verify this, working on the fix/testing it now.

Changed in maas:
status: Incomplete → Triaged
Changed in maas:
assignee: nobody → Newell Jensen (newell-jensen)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.