[2.4] DHCPProbeService with authbind and uvloop causes eventloop to lock up

Bug #1768575 reported by Eric Vasquez
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Blake Rouse

Bug Description

During Foundations Engine deployment it looks like an interface was lost on a rack controller.
Afterwards a python Exception was thrown and the controller remained in degraded condition.

Traceback snippet:
2018-05-02 10:37:35 maasserver.rpc.regionservice: [info] Rack controller '6ww7g6' disconnected.
2018-05-02 10:37:35 RegionServer,0,::ffff:10.245.209.170: [info] RegionServer connection lost (HOST:IPv6Address(TCP, '::ffff:10.245.209.170', 5252
) PEER:IPv6Address(TCP, '::ffff:10.245.209.170', 45442))
2018-05-02 10:37:35 -: [critical] WSGI application error
        Traceback (most recent call last):
          File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 122, in callWithContext
            return self.currentContext().callWithContext(ctx, func, *args, **kw)
          File "/usr/lib/python3/dist-packages/twisted/python/context.py", line 87, in callWithContext
            self.contexts.pop()
          File "/usr/lib/python3/dist-packages/provisioningserver/utils/twisted.py", line 883, in callInContext
            return func(*args, **kwargs)
          File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 522, in run
            self.started = True
        --- <exception caught here> ---
          File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 500, in run
            self.write(elem)
          File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 455, in write
            self.reactor, wsgiWrite, self.started)
          File "/usr/lib/python3/dist-packages/twisted/internet/threads.py", line 122, in blockingCallFromThread
            result.raiseException()
          File "/usr/lib/python3/dist-packages/twisted/python/failure.py", line 385, in raiseException
            raise self.value.with_traceback(self.tb)
        builtins.AttributeError: 'NoneType' object has no attribute 'writeHeaders'

2018-05-02 10:37:35 regiond: [info] 127.0.0.1 POST /MAAS/metadata/2012-03-01/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)
2018-05-02 10:37:35 asyncio: [error] Exception in callback <function AsyncioSelectorReactor.callLater.<locals>.run at 0x7fa101c64400>
handle: <Handle AsyncioSelectorReactor.callLater.<locals>.run>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 47, in uvloop.loop.Handle._run
  File "/usr/lib/python3/dist-packages/twisted/internet/asyncioreactor.py", line 290, in run
    f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/twisted/web/wsgi.py", line 510, in wsgiError
    self.request.loseConnection()
  File "/usr/lib/python3/dist-packages/twisted/web/http.py", line 1474, in loseConnection
    self.channel.loseConnection()
AttributeError: 'NoneType' object has no attribute 'loseConnection'
2018-05-02 10:37:35 regiond: [info] 127.0.0.1 POST /MAAS/metadata/2012-03-01/ HTTP/1.1 --> 200 OK (referrer: -; agent: Python-urllib/3.6)

Related branches

Revision history for this message
Eric Vasquez (envas) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hey Eric,

PLease specify MAAS version and:

 - Does anything break if it is degraded ? Does it not fix itself afterwards ?

Changed in maas:
status: New → Incomplete
milestone: none → 2.4.0rc1
Revision history for this message
Eric Vasquez (envas) wrote :

Apologies

Version is: 2.4.0~beta3-6903-g0e8a9bc-0ubuntu1~18.04.1~20180502~ubuntu18.04.1

I know that foundations engine waits approx. 1200 seconds for all interfaces to be in ready state.
And that this one times out

Exception: Timed out waiting for racks to enter "running" state: {'eyc84a': 'running', '6ww7g6': 'degraded', 'm44cpe': 'running'}
Makefile:98: recipe for target 'infra' failed

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Eric,

Looking at the logs I cannot find anything obvious as everything seems connected. If you reproduce this issue with beta3 please attach the logs with the debug logging enabled.

Leaving this as incomplete in the meantime.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Here's a reproduction, but one controller stayed in 'dead' this time instead of 'degraded'. This is on 2.4 beta 3 with debug logs enabled.

Changed in maas:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Another repro with a controller in 'degraded' state.

Changed in maas:
status: New → In Progress
assignee: nobody → Blake Rouse (blake-rouse)
tags: added: cdo-release-blocker
Revision history for this message
Andres Rodriguez (andreserl) wrote :

So I had a similar issue here where my controller would stay degraded. I upgraded to latest master (which has fixed a issue with debug logging of RPC), and after upgrading, I no longer see this issue.

As such, I'm thinking that this particular issue could be related to the following (LP: #1769929)

https://git.launchpad.net/maas/commit/?id=291c48e39729183dbeb7b50143067deb89b0092f

Revision history for this message
Andres Rodriguez (andreserl) wrote :

based on the stacktrace, it seems that probing for DHCP servers may be the culprit. This is basically causing rack controllers to lock up.

#0 0x00007eff8e19b23a in __waitpid (pid=22273, stat_loc=0x7ffd21461738, options=0)
    at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1 0x00007eff8e3a9295 in bind () from /usr/lib/authbind/libauthbind.so.1
#2 0x00000000005d4cfc in sock_bind (s=0x7eff872409a8, addro=<optimised out>) at ../Modules/socketmodule.c:2734
#3 0x00000000004c546d in _PyCFunction_FastCallDict (kwargs=0x0, nargs=1, args=0x3200198,
    func_obj=<built-in method bind of socket object at remote 0x7eff872409a8>) at ../Objects/methodobject.c:209
#4 _PyCFunction_FastCallKeywords (
    func=func@entry=<built-in method bind of socket object at remote 0x7eff872409a8>, stack=stack@entry=0x3200198,
    nargs=nargs@entry=1, kwnames=kwnames@entry=0x0) at ../Objects/methodobject.c:294
#5 0x000000000054ffe4 in call_function (pp_stack=pp_stack@entry=0x7ffd21461cb8, oparg=<optimised out>,
    kwnames=kwnames@entry=0x0) at ../Python/ceval.c:4824
#6 0x00000000005546cf in _PyEval_EvalFrameDefault (f=<optimised out>, throwflag=<optimised out>)
    at ../Python/ceval.c:3322
#7 0x000000000054fbe1 in PyEval_EvalFrameEx (throwflag=0,
    f=Frame 0x31fffe8, for file /usr/lib/python3/dist-packages/provisioningserver/dhcp/detect.py, line 283, in send_
dhcp_request_packet (request=<DHCPDiscoverPacket(mac_bytes=b'\x18\xa9\x05MB\xa6', mac_str='18:a9:05:4d:42:a6', secon
ds=4, transaction_id=b'\xf3\xb3\xfar') at remote 0x7eff80a24b38>, ifname='enp2s0f1', sock=<socket at remote 0x7eff87
2409a8>, mac='18:a9:05:4d:42:a6', bind_address='192.168.1.32')) at ../Python/ceval.c:753

Changed in maas:
importance: Undecided → Critical
summary: - Rack Controller stays in degraded condition on deployment
+ [2.4] DHCPProbeService with authbind and uvloop causes eventloop to lock
+ up
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.