Refresh scripts are not re-run if they pass, but fail to report the results to the region

Bug #2003310 reported by Marian Gasparovic
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Björn Tillenius
3.3
Fix Released
High
Björn Tillenius

Bug Description

Same symptoms as old bug https://bugs.launchpad.net/maas/+bug/1847794

2023-01-19-02:36:26 root DEBUG [localhost]: maas root vm-host compose 3 hostname=vault-3 cores=2 memory=4096 storage=40.0 zone=4 'interfaces=eth0:space=oam-space;eth1:space=internal-space'
2023-01-19-02:36:31 root ERROR [localhost] Command failed: maas root vm-host compose 3 hostname=vault-3 cores=2 memory=4096 storage=40.0 zone=4 'interfaces=eth0:space=oam-space;eth1:space=internal-space'
2023-01-19-02:36:31 root ERROR [localhost] STDOUT follows:
Pod must be on a known host if interfaces are specified.

Logs: https://oil-jenkins.canonical.com/artifacts/6d1584f1-626d-4682-8844-a8d76249cb7e/generated/generated/maas/logs-2023-01-19-02.37.26.tgz

Related branches

Changed in maas:
milestone: none → 3.3.0
summary: - [3.3-rc2] Pod must be on a known host if interfaces are specified.
+ Pod must be on a known host if interfaces are specified.
Changed in maas:
milestone: 3.3.0 → 3.4.0
importance: Undecided → High
status: New → Triaged
Revision history for this message
Björn Tillenius (bjornt) wrote : Re: Pod must be on a known host if interfaces are specified.

From the log files, this looks to be the culprit: https://pastebin.canonical.com/p/3rrCTTbJxs/

Revision history for this message
Björn Tillenius (bjornt) wrote :

It seems like the machine-resources binary is failing for some reason. But 20-maas-03-machine-resources swallows the error, which makes MAAS fail in the annotation stage, at which the results are thrown away. So we can't see what error machine-resources is reporting.

I'm going to fix 20-maas-03-machine-resources so that it doesn't swallow the error.

In the meantime, if you could run this on machines where you see this error, it would
be helpful for this bug:

  $ sudo snap run --shell maas.supervisor
  $ /snap/maas/current/usr/share/maas/machine-resources/amd64

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Adam Collard (adam-collard) wrote :

> $ sudo snap run --shell maas.supervisor
> $ /snap/maas/current/usr/share/maas/machine-resources/amd64

Note in the second line of https://bugs.launchpad.net/maas/+bug/2003310/comments/2 you should pick the architecture that matches your host

Revision history for this message
Marian Gasparovic (marosg) wrote :

Here is the output, it is 3800 lines

Revision history for this message
Marian Gasparovic (marosg) wrote :

Actual command that failed when above output was collected

2023-01-23-16:35:48 foundationcloudengine.layers.configuremaas INFO Creating vault-1 in sqa-lab2-infra-1-arm
2023-01-23-16:35:48 root DEBUG [localhost]: maas root vm-host compose 1 hostname=vault-1 cores=2 memory=4096 storage=40.0 zone=2 'interfaces=eth0:space=oam-space;eth1:space=internal-space'
2023-01-23-16:35:52 root ERROR [localhost] Command failed: maas root vm-host compose 1 hostname=vault-1 cores=2 memory=4096 storage=40.0 zone=2 'interfaces=eth0:space=oam-space;eth1:space=internal-space'
2023-01-23-16:35:52 root ERROR [localhost] STDOUT follows:
Pod must be on a known host if interfaces are specified.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Thanks Marian. Unfortunately, that command didn't fail, as I expected it to. Do you have any failed runs with 3.3.0-rc3? There we should have some more debug information to go on.

Changed in maas:
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Jeffrey Chang (modern911) wrote :
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :
Revision history for this message
Marian Gasparovic (marosg) wrote :
Changed in maas:
status: Incomplete → New
Revision history for this message
Björn Tillenius (bjornt) wrote :

I took a look at a live environment where this bug occurred. 20-maas-03-machine-resources had PASSED and had the expected output. 50-maas-01-commissioning was PENDING.

The fact that 50-maas-01-commissioning was PENDING should mean that it in fact was PASSED, but then due to this bug the transaction was aborted and the status reverted back to PENDING.

50-maas-01-commissioning pretty much opens up a file and parse it as json, and then dumps it again. So it should mean that it produced valid json. But then when we try to annotate the data, we get None instead.

I'm still not sure what's going on there. I restart of MAAS fixed the problem and every commissioning script succeeded.

At this point, my best guess would be an old twisted bug we saw in Landscape, where in rare occasions after executing a subprocess, reading from stdout would result in None, instead of the real stdout.

Changed in maas:
status: New → Triaged
Revision history for this message
Björn Tillenius (bjornt) wrote :

Some observerations looking at https://oil-jenkins.canonical.com/artifacts/bd827ab3-7242-4f94-a1d2-cad7a93bbcb8/generated/generated/maas/logs-2023-03-08-01.58.05.tgz

leafeon and meinfoo seems fine, they have all scripts passing. swoobat have all scripts passing, except for 50-maas-01-commissioning and maas-list-modaliases, which are pending: https://pastebin.canonical.com/p/33x5WJfT2T/

Again, we see failures to update network interface configuration. But for leafeon, it seems like we eventually succeed: https://pastebin.canonical.com/p/YfDMJyyKpS/

For swoobat we see similar failures, but the script never gets marked as passed, as with leafeon: https://pastebin.canonical.com/p/XkVVZhwypb/

The error about failing to update network interfaces comes from NetworksMonitoringService.do_action():

       try:
            interfaces = yield maybeDeferred(self.getInterfaces)
            yield self._updateInterfaces(interfaces)
        except BaseException as e:
            msg = (
                "Failed to update and/or record network interface "
                "configuration: %s; interfaces: %r" % (e, interfaces)
            )
            log.err(None, msg)

so it's actually not clear where the error is actually happening.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Seems to be some error happening while reporting the results of the refresh scripts:

  https://pastebin.canonical.com/p/v98cqnzmb6/

In addition to this, the refresh() code doesn't notice these errors. They get swallowed by signal_wrapper, and we continue with the refresh even though something went wrong.

Revision history for this message
Björn Tillenius (bjornt) wrote :

I've created a snap branch that prints out some more debug logs, which hopefully will pin-point where the error is.

The branch for the maas snap is 3.3/edge/bug-2003310-debug

It would be great to try to reproduce the error using that snap and take a look at the logs.

Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :

Thanks!

Those logs confirms my suspicion. The problem is indeed that MAAS is swallowing the exception in signal_wrapper().

What happpens is that maas_url is set to a VIP, and it probably either gets moved around and/or pointing to a MAAS that's not ready yet.

So when MAAS runs the refresh scripts, there's first a period where most fail, since 20-maas-03-machine-resources can't download the resource binary.

But after that, all the refresh scripts pass. However, when reporting that the scripts have passed via signal_wrapper, sometimes it fails doing so. Now there is a situation where the controller thinks all scripts have passed, but it hasn't been recorded in the database. And they won't run again, unless the host interface configuration changes.

We need to act on errors in signal_wrapper and not consider the script passed in that case. The scripts need to be retried until they all pass.

Changed in maas:
assignee: nobody → Björn Tillenius (bjornt)
status: Triaged → In Progress
summary: - Pod must be on a known host if interfaces are specified.
+ Refresh scripts are not re-run if they pass, but fail to report the
+ results to the region
Changed in maas:
status: In Progress → Fix Committed
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta1
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.