omshell takes 100%cpu

Bug #1904425 reported by Björn Tillenius
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Alberto Donato
2.9
Fix Released
Undecided
Unassigned

Bug Description

This is with 2.9.0rc1 installed as a snap.

While composing VMs, I suddenly couldn't compose any more. It took a very long time spinning om "Composing..." and then finally returning:

  Error:Pod unable to compose machine:

On the controller page everything looks green, but in rackd.log I see this:

2020-11-16 16:03:33 provisioningserver.rpc.clusterservice: [critical] Failed to contact region. (While requesting RPC info at http://10.245.136.7:5240/MAAS).
        Traceback (most recent call last):
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 460, in callback
            self._startRunCallbacks(result)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
        --- <exception caught here> ---
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1367, in _doUpdate
            eventloops, maas_url = yield self._get_rpc_info(urls)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
            raise config_exc
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
            errors[0].raiseException()
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
            raise last_exc
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1631, in _get_rpc_info
            raise config_exc
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1602, in _get_rpc_info
            eventloops, maas_url = yield self._parallel_fetch_rpc_info(urls)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1576, in handle_responses
            errors[0].raiseException()
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1537, in _serial_fetch_rpc_info
            raise last_exc
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/clusterservice.py", line 1529, in _serial_fetch_rpc_info
            response = yield self._fetch_rpc_info(url, orig_url)
        twisted.internet.error.DNSLookupError: DNS lookup failed: Couldn't find the hostname '10.245.136.7'.

2020-11-16 16:03:38 provisioningserver.rpc.clusterservice: [critical] DHCPv6 configure timed out
        Traceback (most recent call last):
        --- <exception caught here> ---
          File "/snap/maas/10507/lib/python3.8/site-packages/provisioningserver/rpc/dhcp.py", line 348, in configure
            yield deferToThread(_delete_config, server)
        twisted.internet.defer.CancelledError:

Related branches

Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Revision history for this message
Björn Tillenius (bjornt) wrote :
Changed in maas:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.9.0rc2
Revision history for this message
Björn Tillenius (bjornt) wrote :

Actually, the DNS lookup failure might be a red herring (although, still a problem), since I now see that it has been showing up in the logs a few weeks now.

The pod errors don't give any clue what's wrong, but I seem to have lost the ability to compose vms on any pod now (after being able to compose 6 before that):

2020-11-16 16:16:42 maasserver.websockets.handlers.pod: [critical] Failed to compose machine.
        Traceback (most recent call last):
        --- <exception caught here> ---
          File "/snap/maas/10507/lib/python3.8/site-packages/maasserver/forms/pods.py", line 825, in async_compose_machine
            result = yield compose_machine(
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/protocols/amp.py", line 1994, in _massageError
            error.trap(RemoteAmpError)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 439, in trap
            self.raiseException()
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/python/failure.py", line 467, in raiseException
            raise self.value.with_traceback(self.tb)
        twisted.internet.defer.CancelledError:

2020-11-16 16:16:42 maasserver.websockets.protocol: [critical] Error on request (391) pod.compose: Pod unable to compose machine:
        Traceback (most recent call last):
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1475, in gotResult
            _inlineCallbacks(r, g, status)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 1464, in _inlineCallbacks
            status.deferred.errback()
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 501, in errback
            self._startRunCallbacks(fail)
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
            self._runCallbacks()
        --- <exception caught here> ---
          File "/snap/maas/10507/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
            current.result = callback(current.result, *args, **kw)
          File "/snap/maas/10507/lib/python3.8/site-packages/maasserver/websockets/handlers/pod.py", line 403, in wrap_errors
            raise PodProblem(
        maasserver.exceptions.PodProblem: Pod unable to compose machine:

Revision history for this message
Björn Tillenius (bjornt) wrote :

Looking at the process list, I see 10 omshell processes, each using 100% CPU.

Revision history for this message
Alberto Donato (ack) wrote :

I can reproduce this quiten often with just a plain omshell by creating a file with

server 127.0.0.1
port 7911
key omapi_key <omapi key>
connect

and calling "omshell < file"

Most of the time the call just fails or segfaults, but sometimes it gets stuck when it should exit and just spins at 100% CPU

A strace on the process shows it keep select()'ing on the connection socket, after the server has closed the connection.

This seems to point at an omshell bug.

Alberto Donato (ack)
Changed in maas:
assignee: nobody → Alberto Donato (ack)
importance: High → Critical
status: Triaged → In Progress
Alberto Donato (ack)
summary: - Failed to contact region. (While requesting RPC info)
+ omshell takes 100%cpu
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
milestone: 2.9.0rc2 → 2.10-next
Changed in maas:
milestone: 3.0.0 → none
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.