If start_nodes() fails, it doesn't clean up after itself.

Bug #1330765 reported by Julian Edwards
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Graham Binns

Bug Description

This method needs to start getting a bit better at deciding whether it's atomic or not. Currently, if it receives a StaticIPAddressExhaustion error it will just bail out, potentially leaving addresses hanging that were already allocated for previous nodes.

Tags: robustness

Related branches

Changed in maas:
status: New → Triaged
importance: Undecided → High
tags: added: robustness
Raphaël Badin (rvb)
Changed in maas:
milestone: none → 1.7.0
Changed in maas:
milestone: 1.7.0 → next
Revision history for this message
Christian Reis (kiko) wrote :

Agree with moving to next if this is just handling Static IP exhaustion (which is unfortunate but not a showstopper).

Revision history for this message
Graham Binns (gmb) wrote :

Weirdly, I don't think start_nodes() (or stop_nodes()) can be properly atomic, because they handle multiple nodes, and one of those nodes could fail… etc. That said, *nowhere* is start_nodes() called with > 1 node, because reasons. So the callsites *are* atomic, which means that we could at least add explicit atomicity there and so have nice-ish rollbacks when things blow up for individual nodes.

Graham Binns (gmb)
Changed in maas:
status: Triaged → In Progress
assignee: nobody → Graham Binns (gmb)
Revision history for this message
Graham Binns (gmb) wrote :

We're now getting to the point where this can be a serious problem so I'm bumping the importance up. I'm already working on a fix, though, so that's mostly for the sake of making everyone else aware of what's going on.

Changed in maas:
importance: High → Critical
Revision history for this message
Graham Binns (gmb) wrote :

After discussing this in the MAAS core team meeting, we came to the conclusion that since start_nodes() and stop_nodes() are never ever used with > 1 node, we should refactor them to handle *only* one node. That makes it much easier to be robust in the face of failure, because we're not having to revert parts of what's being done in those methods should something go wrong.

Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Christian Reis (kiko) wrote :

See also bug 1384926.

Revision history for this message
Graham Binns (gmb) wrote :

As I commented in bug 1384926, this bug and that one are not related:

 - This bug is about the fact that, if start_nodes() fails it can leave *static* IPs, hostmaps and DNS records lying around as shouldn't be.
 - Bug 1384926 is about the fact that when MAAS runs out or is close to running out of *dynamic* IPs, it doesn't warn anyone, and DHCP just starts failing silently.

Christian Reis (kiko)
Changed in maas:
milestone: next → 1.7.1
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.