Merge into master : multi-master : lp:~kissiel/checkbox-ng : Git : Code : Next Generation Checkbox (CLI)

Status:

Merged

Approved by:

Sylvain Pineau on 2020-04-03

Approved revision:

7ac269310a27d0f49981ceb3c40d524dc055b7b0

Merged at revision:

40a6cb8d780f35627f9525fe4ea5e0b8073eb262

Proposed branch:

~kissiel/checkbox-ng:multi-master

Merge into:

checkbox-ng:master

Diff against target:

116 lines (+66/-3)

2 files modified

checkbox_ng/launcher/master.py (+31/-3)
checkbox_ng/launcher/slave.py (+35/-0)

Medium

Fix Released

Link a bug report

Reviewer	Date Requested	Status
Sylvain Pineau (community)	2020-02-04	Approve on 2020-04-02
Maciej Kisielewski (community)		Needs Resubmitting on 2020-03-31
Review via email: mp+378497@code.launchpad.net

Description of the change

Fix a problem where two masters could be connected to the same slave casuing lots of race conditions.

Revision history for this message

Sylvain Pineau (sylvain-pineau) wrote on 2020-03-26:

#

Download full text (3.1 KiB)

Got the following tracebacks when the second master tries to connect:

$ checkbox-cli master 127.0.0.1 ./mylauncher
Connection lost!
connection closed by peer
Connection lost!
connection closed by peer

and this one on slave:

$ sudo checkbox-cli slave
client connection terminated abruptly
Traceback (most recent call last):
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 181, in _authenticate_and_serve_client
    self._serve_client(sock2, credentials)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 202, in _serve_client
    conn = self.service._connect(Channel(SocketStream(sock)), config)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/service.py", line 100, in _connect
    self.on_connect(conn)
  File "/tmp/checkbox-ng/checkbox_ng/launcher/slave.py", line 60, in on_connect
    if SessionAssistantSlave.master_blaster:
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 220, in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 75, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/protocol.py", line 471, in sync_request
    return self.async_request(handler, *args, timeout=timeout).value
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 95, in value
    self.wait()
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 47, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 181, in _authenticate_and_serve_client
    self._serve_client(sock2, credentials)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 202, in _serve_client
    conn = self.service._connect(Channel(SocketStream(sock)), config)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/service.py", line 100, in _connect
    self.on_connect(conn)
  File "/tmp/checkbox-ng/checkbox_ng/launcher/slave.py", line 60, in on_connect
    if SessionAssistantSlave.master_blaster:
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 220, in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 75, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/protocol.py", line 471, in sync_request
    return self.async_request(handler, *args, timeout=timeout).value
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 95, in value
    self.wait()
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 47, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired

To be precise it happens when I try to connect the second master while being still at the test selection urwid scree...

Got the following tracebacks when the second master tries to connect:

$ checkbox-cli master 127.0.0.1 ./mylauncher
Connection lost!
connection closed by peer
Connection lost!
connection closed by peer

and this one on slave:

$ sudo checkbox-cli slave
client connection terminated abruptly
Traceback (most recent call last):
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 181, in _authenticate_and_serve_client
    self._serve_client(sock2, credentials)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 202, in _serve_client
    conn = self.service._connect(Channel(SocketStream(sock)), config)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/service.py", line 100, in _connect
    self.on_connect(conn)
  File "/tmp/checkbox-ng/checkbox_ng/launcher/slave.py", line 60, in on_connect
    if SessionAssistantSlave.master_blaster:
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 220, in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 75, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/protocol.py", line 471, in sync_request
    return self.async_request(handler, *args, timeout=timeout).value
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 95, in value
    self.wait()
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 47, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 181, in _authenticate_and_serve_client
    self._serve_client(sock2, credentials)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/utils/server.py", line 202, in _serve_client
    conn = self.service._connect(Channel(SocketStream(sock)), config)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/service.py", line 100, in _connect
    self.on_connect(conn)
  File "/tmp/checkbox-ng/checkbox_ng/launcher/slave.py", line 60, in on_connect
    if SessionAssistantSlave.master_blaster:
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 220, in method
    return syncreq(_self, consts.HANDLE_CALLATTR, name, args, kwargs)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/netref.py", line 75, in syncreq
    return conn.sync_request(handler, proxy, *args)
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/protocol.py", line 471, in sync_request
    return self.async_request(handler, *args, timeout=timeout).value
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 95, in value
    self.wait()
  File "/tmp/checkbox-ng/plainbox/vendor/rpyc/core/async_.py", line 47, in wait
    raise AsyncResultTimeout("result expired")
TimeoutError: result expired

To be precise it happens when I try to connect the second master while being still at the test selection urwid screen

review: Needs Fixing

Revision history for this message

Maciej Kisielewski (kissiel) wrote on 2020-03-31:

#

TYVM for catching that. The crash(es) were caused by urwid blocking the master thread therefore blocking a way for slave to run remote kill.

I added code that catches that timeout and just disconnects the master in a less informative way (no information about the new master), but at least it works :D

I feel we cannot do a better thing without rewriting a lot of things.

review: Needs Resubmitting

Revision history for this message

Sylvain Pineau (sylvain-pineau) wrote on 2020-04-02:

#

Thanks, +1

review: Approve

 diff --git a/checkbox_ng/launcher/master.py b/checkbox_ng/launcher/master.py
 index 1565b4a..f110860 100644
 --- a/checkbox_ng/launcher/master.py
 +++ b/checkbox_ng/launcher/master.py
@@ -19,6 +19,7 @@
  This module contains implementation of the master end of the remote execution
  functionality.
  """
++import contextlib
  import getpass
  import gettext
  import ipaddress
@@ -166,6 +167,7 @@ class RemoteMaster(ReportsStage, MainLoopStage):
          config['allow_all_attrs'] = True
          config['sync_request_timeout'] = 120
          keep_running = False
++        server_msg = None
          self._prepare_transports()
          interrupted = False
          while True:
@@ -181,6 +183,18 @@ class RemoteMaster(ReportsStage, MainLoopStage):
                          break
                  conn = rpyc.connect(host, port, config=config)
                  keep_running = True
++                def quitter(msg):
++                    # this will be called when the slave decides to disconnect
++                    # this master
++                    nonlocal server_msg
++                    nonlocal keep_running
++                    keep_running = False
++                    server_msg = msg
++                with contextlib.suppress(AttributeError):
++                    # TODO: REMOTE_API
++                    # when bumping the remote api make this bit obligatory
++                    # i.e. remove the suppressing
++                    conn.root.register_master_blaster(quitter)
                  self._sa = conn.root.get_sa()
                  self.sa.conn = conn
                  if not self._sudo_provider:
@@ -213,9 +227,23 @@ class RemoteMaster(ReportsStage, MainLoopStage):
                          self.resume_interacting, interaction=payload),
                  }[state]()
              except EOFError as exc:
--                print("Connection lost!")
--                _logger.info("master: Connection lost due to: %s", exc)
--                time.sleep(1)
++                if keep_running:
++                    print("Connection lost!")
++                    # this is yucky but it works, in case of explicit
++                    # connection closing by the slave we get this msg
++                    _logger.info("master: Connection lost due to: %s", exc)
++                    if str(exc) == 'stream has been closed':
++                        print('Slave explicitly disconnected you. Possible '
++                              'reason: new master connected to the slave')
++                        break
++                    print(exc)
++                    time.sleep(1)
++                else:
++                    # if keep_running got set to False it means that the
++                    # network interruption was planned, AKA slave disconnected
++                    # this master
++                    print(server_msg)
++                    break
              except (ConnectionRefusedError, socket.timeout, OSError) as exc:
                  _logger.info("master: Connection lost due to: %s", exc)
                  if not keep_running:
 diff --git a/checkbox_ng/launcher/slave.py b/checkbox_ng/launcher/slave.py
 index b568620..3eed172 100644
 --- a/checkbox_ng/launcher/slave.py
 +++ b/checkbox_ng/launcher/slave.py
@@ -39,10 +39,45 @@ _logger = logging.getLogger("slave")
  class SessionAssistantSlave(rpyc.Service):
      session_assistant = None
++    controlling_master_conn = None
++    master_blaster = None
      def exposed_get_sa(*args):
          return SessionAssistantSlave.session_assistant
++    def exposed_register_master_blaster(self, callable):
++        """
++        Register a callable that will be called when the slave decides to
++        disconnect the master. This should be used to prepare the master for
++        the disconnection, so it can differentiate between network failures
++        and a planned disconnect.
++        The callable will be called with one param - a string with a reason
++        for the disconnect.
++        """
++        SessionAssistantSlave.master_blaster = callable
++
++    def on_connect(self, conn):
++        try:
++            if SessionAssistantSlave.master_blaster:
++                msg = 'Forcefully disconnected by new master from {}:{}'.format(
++                    conn._config['endpoints'][1][0], conn._config['endpoints'][1][1])
++                SessionAssistantSlave.master_blaster(msg)
++                old_master = SessionAssistantSlave.controlling_master_conn
++                if old_master is not None:
++                    old_master.close()
++                SessionAssistantSlave.master_blaster = None
++
++            SessionAssistantSlave.controlling_master_conn = conn
++        except TimeoutError as exc:
++            # this happens when the reference to .master_blaster times out,
++            # meaning the master is blocked on an urwid screen or some other
++            # thread blocking operation. In any case it means there was a
++            # previous master, so we need to kill it
++            old_master = SessionAssistantSlave.controlling_master_conn
++            SessionAssistantSlave.master_blaster = None
++            old_master.close()
++            SessionAssistantSlave.controlling_master_conn = conn
++
  class RemoteSlave():
      """

Next Generation Checkbox (CLI)

Merge ~kissiel/checkbox-ng:multi-master into checkbox-ng:master

Commit message

Description of the change

Preview Diff

Subscribers