machine deploy install_kvm=True fails

Bug #2009805 reported by Marian Gasparovic
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Alberto Donato
3.2
Fix Released
High
Jack Lloyd-Walters
3.3
Fix Released
High
Björn Tillenius

Bug Description

`machine deploy 63rsbp install_kvm=True` did not finish in 30 minutes (usually it takes less than 15)

We did not test this in 3.3.0, works fine with 3.2

Logs - https://oil-jenkins.canonical.com/artifacts/fb3810d0-acaf-4a24-bc69-4f0afc1f24b4/generated/generated/maas/logs-2023-03-08-12.22.13.tgz

Related branches

tags: added: cdo-qa foundations-engine
removed: cdo-
description: updated
Revision history for this message
Marian Gasparovic (marosg) wrote :

It fails also with 3.3/stable

Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :

I can reproduce this locally. With 3.3.1 trying to deploy 20.04 as a KVM (LXD) host, it's not been almost ten minutes since the machine rebooted and reported "Node installation - 'cloudinit' running config-final-message with frequency always". The machine is still Deploying (Rebooting) and no KVM has been registered.

Changed in maas:
milestone: none → 3.4.0
status: New → In Progress
assignee: nobody → Björn Tillenius (bjornt)
importance: Undecided → High
Revision history for this message
Björn Tillenius (bjornt) wrote :

In the solqa runs I see this:

  maasserver.exceptions.PodProblem: Failed talking to pod: Failed to login to virsh console.

In my local run I see this:

  maasserver.exceptions.PodProblem: Failed talking to pod: Pod 3: Failed t
o connect to the LXD REST API: HTTPSConnectionPool(host='10.5.43.255'...

At least for me locally, it chooses the wrong IP somehow. The machine has only one IP, 10.5.32.6. I suspect it's the same for the solqa runs, even though the virsh error message doesn't reveal the IP.

Revision history for this message
Björn Tillenius (bjornt) wrote :

The problem is in _get_ip_address_for_vmhost(). It gets any IP that's associated with the boot interface and any of its parents, and then return one of it.

We need to check what kind of IPs they are, so that we return one that MAAS configured the machine to have. What happens now it is that sometimes the right IP is returned, sometimes the machine it used while commissioning.

FWIW, deploying a second time made it work for me. It's only when I deploy right after commissioning the machine that I see this problem.

Revision history for this message
Björn Tillenius (bjornt) wrote :

Turns out that this bug was already fixed in master, 91442ef7aecbcd2caf345b8abad579876f5d9448.

I'm backporting the fix now.

summary: - [3.3.1 rc] machine deploy install_kvm=True fails
+ machine deploy install_kvm=True fails
Changed in maas:
status: In Progress → Fix Committed
assignee: Björn Tillenius (bjornt) → Alberto Donato (ack)
Revision history for this message
Chris Johnston (cjohnston) wrote :

I am seeing this error on:

installed: 3.2.7-12037-g.c688dd446 (26274) 148MB -

2023-03-17 19:10:57 metadataserver.api_twisted: [critical] Failed to process status message instantly.
        Traceback (most recent call last):
          File "/usr/lib/python3.8/threading.py", line 870, in run
            self._target(*self._args, **self._kwargs)
          File "/snap/maas/26274/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 821, in worker
            return target()
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/_threads/_threadworker.py", line 46, in work
            task()
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/_threads/_team.py", line 190, in doWork
            task()
        --- <exception caught here> ---
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 250, in inContext
            result = inContext.theWork()
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/python/threadpool.py", line 266, in <lambda>
            inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/snap/maas/26274/usr/lib/python3/dist-packages/twisted/python/context.py", line 85, in callWithContext
            return func(*args,**kw)
          File "/snap/maas/26274/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 856, in callInContext
            return func(*args, **kwargs)
          File "/snap/maas/26274/lib/python3.8/site-packages/provisioningserver/utils/twisted.py", line 202, in wrapper
            result = func(*args, **kwargs)
          File "/snap/maas/26274/lib/python3.8/site-packages/metadataserver/api_twisted.py", line 585, in _processMessageNow
            self._processMessage(node, message)
          File "/snap/maas/26274/lib/python3.8/site-packages/maasserver/utils/orm.py", line 756, in call_within_transaction
            return func_outside_txn(*args, **kwargs)
          File "/snap/maas/26274/lib/python3.8/site-packages/maasserver/utils/orm.py", line 559, in retrier
            return func(*args, **kwargs)
          File "/usr/lib/python3.8/contextlib.py", line 75, in inner
            return func(*args, **kwds)
          File "/snap/maas/26274/lib/python3.8/site-packages/metadataserver/api_twisted.py", line 493, in _processMessage
            _create_vmhost_for_deployment(node)
          File "/snap/maas/26274/lib/python3.8/site-packages/metadataserver/api_twisted.py", line 262, in _create_vmhost_for_deployment
            discover_and_sync_vmhost(pod, node.owner)
          File "/snap/maas/26274/lib/python3.8/site-packages/maasserver/vmhost.py", line 77, in discover_and_sync_vmhost
            raise PodProblem(str(error))
        maasserver.exceptions.PodProblem: Failed talking to pod: Failed to login to virsh console.

Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta1
Revision history for this message
Luca Cervigni (cervigni) wrote :

We seem to be affected by this too. Running maas 3.2.7
Can this be backported to 3.2 too?

Would a downgrade to 3.2.6 for rackd only fix the issue?

Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.