Launchpad itself

Merge lp:~allenap/launchpad/ec2-parry into lp:launchpad

ec2-parry
Merge into devel

Proposed by Gavin Panella on 2009-11-10

Status:	Rejected
Rejected by:	Gavin Panella on 2009-12-23
Proposed branch:	lp:~allenap/launchpad/ec2-parry
Merge into:	lp:launchpad
Diff against target:	1371 lines (+1043/-96) 8 files modified lib/devscripts/ec2test/account.py (+3/-0) lib/devscripts/ec2test/builtins.py (+100/-0) lib/devscripts/ec2test/instance.py (+11/-2) lib/devscripts/ec2test/remote.py (+67/-54) lib/devscripts/ec2test/remotenode.py (+539/-0) lib/devscripts/ec2test/remotenodekiller.py (+146/-0) lib/devscripts/ec2test/testrunner.py (+88/-40) utilities/pb-shell (+89/-0)
To merge this branch:	bzr merge lp:~allenap/launchpad/ec2-parry
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Michael Hudson-Doyle		2009-11-10	Abstain on 2009-11-12
Review via email: mp+14693@code.launchpad.net

Revision history for this message

Gavin Panella (allenap) wrote on 2009-11-10:

Download full text (5.9 KiB)

== The new Twisted node daemon, lib/devscripts/ec2test/remotenode.py ==

Diff: http://pastebin.ubuntu.com/315123/

This daemon is meant to make the following possible:

1. Start up N instances (on EC2, but the daemon makes no assumptions
on that front).

2. One of them is told it is the supervisor. It immediately builds the
   LP tree and calculates the complete list of tests to run. It puts
   these tests into N similarly sized parcels for workers to pick up
   later. It also starts the subunit results aggregator service.

3. Every instance is told where the supervisor is. These instances
become workers, and ask the supervisor for work.

4. Each worker runs the tests it has been given, and streams the
results back to the supervisor's result aggregator.

5. When a worker has finished its run, with or without error, it tells
the supervisor that it has attempted to run every test in its
parcel.

6. The supervisor keeps track of which tests it actually has results
for; there seem to be some problems where workers don't actually
run all of the tests they're asked to.

7. The worker asks for more work. If there isn't any, it shuts
down.

8. Once the supervisor has responses for all parcels, it sends a
comprehensive report email, then shuts down.

9. Another process (remotenodekiller.py) takes care of actually
shutting each instance down, when either the node daemon has
exited, or if the logs have not had any activity for 10 minutes.

Currently there is no support for submitting to PQM after a test run,
and the email address is hard-wired to <email address hidden>. I
think I might fix the latter before I land this branch :)

Anyway, the daemon has three services: controller, subunit results
aggregator, and inspection. The first two are needed to run the
testing service, the last helps during development, and for debugging.

=== Controller ===

The main class here is TestNode. This currently mixes in many
responsibilities that should be split out. It contains the logic for
being a supervisor (including starting the aggregator service and
sending report emails), being a worker, and shutting down. It is also,
in Perspective Broker parlance, a Referencable object, which means its
remote_* methods can be called remotely.

Ideally I would like to separate out concepts like the branch, the
supervisor and the worker, and then have a Referencable object to be
the remote control for them.

=== Subunit Aggregator ===

This is simply a socket that receives line-based output from a
remotely running bin/test process that's been asked to produce subunit
output (the --subunit flag). For each connection, it creates a
subunit.TestProtocolServer instance, to which it blindly passes the
input. This TestProtocolServer is given a results object that the
supervisor has created. This way all test results from all workers are
recorded into a single test result object.

The test result object also calls back to the supervisor for every
test that is run so that the supervisor can keep track of the tests
for which a result is available. It wouldn't be good if it submitted
to PQM if 1000 tests had never actually run.

=== Inspection se...

== The new Twisted node daemon, lib/devscripts/ec2test/remotenode.py ==

Diff: http://pastebin.ubuntu.com/315123/

This daemon is meant to make the following possible:

1. Start up N instances (on EC2, but the daemon makes no assumptions
   on that front).

3. Every instance is told where the supervisor is. These instances
   become workers, and ask the supervisor for work.

4. Each worker runs the tests it has been given, and streams the
   results back to the supervisor's result aggregator.

5. When a worker has finished its run, with or without error, it tells
   the supervisor that it has attempted to run every test in its
   parcel.

6. The supervisor keeps track of which tests it actually has results
   for; there seem to be some problems where workers don't actually
   run all of the tests they're asked to.

7. The worker asks for more work. If there isn't any, it shuts
   down.

8. Once the supervisor has responses for all parcels, it sends a
   comprehensive report email, then shuts down.

9. Another process (remotenodekiller.py) takes care of actually
   shutting each instance down, when either the node daemon has
   exited, or if the logs have not had any activity for 10 minutes.

Currently there is no support for submitting to PQM after a test run,
and the email address is hard-wired to gavin.panella@canonical.com. I
think I might fix the latter before I land this branch :)

=== Controller ===

Ideally I would like to separate out concepts like the branch, the
supervisor and the worker, and then have a Referencable object to be
the remote control for them.

=== Subunit Aggregator ===

=== Inspection service ===

This opens an SSH service on port 8791 (username: admin, password:
admin) that runs a REPL within the running python process. That way
it's possible to poke around and debug the daemon if it does something
unexpected.

== Getting to the point where we can run remotenode.py ==

Diff: http://pastebin.ubuntu.com/315124/

lib/devscripts/ec2test/account.py

New commands parallel-test (implemented in cmd_parallel_test). This
  does not (yet) support the full range of options that the test
  command does. All the code is also inline in the run() method, but I
  think this should be okay for now; it can be refactored later. In my
  defence, the code is already in functions and that this seems to be
  a not uncommon practice for dealing with Twisted callbacks.

lib/devscripts/ec2test/instance.py

New method _command_with_locale() to prevent stupid locale warnings
  from bzr, although it does add "export LC_ALL=C; " to every command
  run, so it's a trade off I guess.

New method run_as_daemon() to start a daemon process by starting a
  background process within a bash subshell. This seems to work
  okay. It's here to support starting remotenodekiller.py.

lib/devscripts/ec2test/testrunner.py

Always create a bazaar.conf on the remote node.

New method, run_node() to start up the remotenode.py daemon.

== Utilities ==

Diff: http://pastebin.ubuntu.com/315127/

lib/devscripts/ec2test/remotenodekiller.py

This script watches a specified list of pids, pid files and the pids
  contained therein, and a list of log files (subject to glob
  expansion) for activity. If a process stops running, or a pid file
  disappears, or a log file gets stale, a given command is run, which
  could, for example, shutdown the system.

utilities/pb-shell

This gives the user a half decent REPL for interacting with the
  remotenode.py process (or any Perspective Broker service
  really). The REPL is started in a thread and it calls back into the
  Twisted reactor to communicate via PB.

== Cleanup ==

Diff: http://pastebin.ubuntu.com/315128/

lib/devscripts/ec2test/ec2test-remote.py => lib/devscripts/ec2test/remote.py

This is the current/old script for running tests on EC2. Originally
  I put all the new Twisted code in this file, but it got out of hand,
  and just didn't feel right, so I ripped it back out again. However,
  Jono and I made a lot of cleanups to this file and it would be a
  shame to lose them.

The main thing was to pass in the prefix directory to run under,
  /var by default. The idea is to make it easier to test locally.

Also, the test() method has been broken into two pieces, run() and
  run_tests(). It's still a less than perfect split, but it's a little
  easier to understand now, IMO.

Of course, I want the new parallelelelelel stuff to supersede this,
  but it's not ready for that yet.

lp:~allenap/launchpad/ec2-parry updated on 2009-11-11

9824. By Gavin Panella on 2009-11-10: Merge devel.
9825. By Gavin Panella on 2009-11-11: time.time() is UTC already.

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2009-11-12:

Download full text (44.9 KiB)

Hi Gavin,

This is about a quarter of a review -- sorry time crunch at the sprint
hasn't let me do anything more :(

Excited to see this progressing naturally, a little concerned about
the complexity...

> === modified file 'lib/devscripts/ec2test/account.py'
> --- lib/devscripts/ec2test/account.py 2009-10-08 14:50:19 +0000
> +++ lib/devscripts/ec2test/account.py 2009-11-12 08:29:26 +0000
> @@ -104,6 +104,9 @@
> security_group.authorize('tcp', 22, 22, '%s/32' % ip)
> security_group.authorize('tcp', 80, 80, '%s/32' % ip)
> security_group.authorize('tcp', 443, 443, '%s/32' % ip)
> + # Authorize Perspective Broker and Subunit. XXX: This is way
> + # too permissive.

Er, I'd say so yes.

Given that what I think you _actually_ want is ec2 instances to be
able to connect to each other on this protocol, not external machines,
you don't want to do the authorization like this, but rather use some
other EC2 feature I currently can't remember the name of :-) Read the
security group docs I guess.

> + security_group.authorize('tcp', 8789, 8790, '0.0.0.0/0')
> for network in demo_networks:
> # Add missing netmask info for single ips.
> if '/' not in network:

> === modified file 'lib/devscripts/ec2test/builtins.py'
> --- lib/devscripts/ec2test/builtins.py 2009-10-16 11:07:37 +0000
> +++ lib/devscripts/ec2test/builtins.py 2009-11-12 08:29:26 +0000
> @@ -306,6 +306,112 @@
> instance.set_up_and_run(postmortem, not headless, runner.run_tests)
>
>
> +class cmd_parallel_test(EC2Command):
> + """Run the test suite in ec2 in parallel."""
> +
> + takes_options = [
> + branch_option,
> + trunk_option,
> + machine_id_option,
> + instance_type_option,
> + postmortem_option,
> + include_download_cache_changes_option,
> + Option(
> + 'jobs', short_name='j', type=int, argname="NUM",
> + help=('The number of instances to start and distribute the '
> + 'test command across.')),
> + ]
> +
> + takes_args = ['test_branch?']
> +
> + def run(self, test_branch=None, branch=None, trunk=False, machine=None,
> + instance_type=DEFAULT_INSTANCE_TYPE, postmortem=False,
> + include_download_cache_changes=False, jobs=1):
> + if branch is None:
> + branch = []
> + branches, test_branch = _get_branches_and_test_branch(
> + trunk, branch, test_branch)
> +
> + if jobs < 1:
> + raise BzrCommandError(
> + 'The number of instances must be greater than zero.')
> +
> + from twisted.internet import reactor
> + from twisted.internet.defer import DeferredList, DeferredLock
> + from twisted.internet.threads import deferToThread
> +
> + from devscripts.ec2test.remotenode import connect_to_node

I don't think there is any reason these imports can't be at the module
level.

> + # Keep a record of workers as they start.
> + workers = []
> +
> + # This is fired once the supervisor has started.
> + supervisor_startup = DeferredLock()
> +
> + def show_error(f...

Hi Gavin,

This is about a quarter of a review -- sorry time crunch at the sprint
hasn't let me do anything more :(

Excited to see this progressing naturally, a little concerned about
the complexity...

> === modified file 'lib/devscripts/ec2test/account.py'
> --- lib/devscripts/ec2test/account.py	2009-10-08 14:50:19 +0000
> +++ lib/devscripts/ec2test/account.py	2009-11-12 08:29:26 +0000
> @@ -104,6 +104,9 @@
>          security_group.authorize('tcp', 22, 22, '%s/32' % ip)
>          security_group.authorize('tcp', 80, 80, '%s/32' % ip)
>          security_group.authorize('tcp', 443, 443, '%s/32' % ip)
> +        # Authorize Perspective Broker and Subunit. XXX: This is way
> +        # too permissive.

Er, I'd say so yes.

> +        security_group.authorize('tcp', 8789, 8790, '0.0.0.0/0')
>          for network in demo_networks:
>              # Add missing netmask info for single ips.
>              if '/' not in network:

> === modified file 'lib/devscripts/ec2test/builtins.py'
> --- lib/devscripts/ec2test/builtins.py	2009-10-16 11:07:37 +0000
> +++ lib/devscripts/ec2test/builtins.py	2009-11-12 08:29:26 +0000
> @@ -306,6 +306,112 @@
>          instance.set_up_and_run(postmortem, not headless, runner.run_tests)
>
>
> +class cmd_parallel_test(EC2Command):
> +    """Run the test suite in ec2 in parallel."""
> +
> +    takes_options = [
> +        branch_option,
> +        trunk_option,
> +        machine_id_option,
> +        instance_type_option,
> +        postmortem_option,
> +        include_download_cache_changes_option,
> +        Option(
> +            'jobs', short_name='j', type=int, argname="NUM",
> +            help=('The number of instances to start and distribute the '
> +                  'test command across.')),
> +        ]
> +
> +    takes_args = ['test_branch?']
> +
> +    def run(self, test_branch=None, branch=None, trunk=False, machine=None,
> +            instance_type=DEFAULT_INSTANCE_TYPE, postmortem=False,
> +            include_download_cache_changes=False, jobs=1):
> +        if branch is None:
> +            branch = []
> +        branches, test_branch = _get_branches_and_test_branch(
> +            trunk, branch, test_branch)
> +
> +        if jobs < 1:
> +            raise BzrCommandError(
> +                'The number of instances must be greater than zero.')
> +
> +        from twisted.internet import reactor
> +        from twisted.internet.defer import DeferredList, DeferredLock
> +        from twisted.internet.threads import deferToThread
> +
> +        from devscripts.ec2test.remotenode import connect_to_node

I don't think there is any reason these imports can't be at the module
level.

> +        # Keep a record of workers as they start.
> +        workers = []
> +
> +        # This is fired once the supervisor has started.
> +        supervisor_startup = DeferredLock()
> +
> +        def show_error(failure):
> +            failure.printTraceback()
> +
> +        def start_instance_and_node(instance, runner):
> +            # Bring up the instance and start the test node service.
> +            try:
> +                instance.start()
> +                runner.run_node()
> +            except:
> +                instance.shutdown()
> +                raise
> +            else:
> +                return instance

This looks a lot like instance.setup_and_run?

> +        def get_root(instance):
> +            return connect_to_node(instance.hostname)
> +
> +        def start_tests(root, instance):
> +            workers.append((root, instance))
> +            if len(workers) == 1:
> +                # Start the supervisor.
> +                d = supervisor_startup.acquire()
> +                d.addCallback(
> +                    lambda _: root.callRemote('become_supervisor', jobs))
> +                d.addCallback(lambda _: supervisor_startup.release())

This can be spelt

supervisor_startup.run(root.callRemote, 'become_supervisor', jobs)

I think?

> +            # Start the worker.
> +            supervisor_root, supervisor_instance = workers[0]
> +            # If the supervisor is still starting, we will have to
> +            # wait for this lock. Once we have it, we release it
> +            # immediately so other workers can start.
> +            d = supervisor_startup.acquire()
> +            d.addCallback(lambda lock: lock.release())
> +            d.addCallback(lambda _: root.callRemote(
> +                    'got_supervisor', supervisor_instance.hostname))

This seems a little weird.  Maybe:

def start_worker():
        # Start the worker.  Note that we don't return the Deferred
        # that callRemote returns -- the various workers can start
        # in parallel.
        root.callRemote('got_supervisor', supervisor_instance.hostname)
    supervisor_startup.run(start_worker)

I also wonder if having the supervisor_startup essentially be a single
deferred that is fired with the supervisor_instance and adding a bunch
of callbacks that do:

def start_worker(supervisor_instance):
        # Start the worker.  Note that we don't return the Deferred
        # that callRemote returns -- the various workers can start
        # in parallel.
        root.callRemote('got_supervisor', supervisor_instance.hostname)
        # Pass on the instance to the next callback.
        return supervisor_instance

Though that's a little odd somehow too.

> +            return d
> +
> +        def create_and_start_instance():
> +            session_name = EC2SessionName.make(EC2TestRunner.name)
> +            instance = EC2Instance.make(
> +                session_name, instance_type, machine)
> +            runner = EC2TestRunner(
> +                test_branch, branches=branches,
> +                include_download_cache_changes=include_download_cache_changes,
> +                instance=instance, launchpad_login=instance._launchpad_login)
> +            # Do the startup in a thread because boto is
> +            # blocking. XXX: Use txAWS here instead?

I guess we could, but let's not bother for now :)

> +            startup = deferToThread(start_instance_and_node, instance, runner)
> +            startup.addCallback(get_root)
> +            startup.addCallback(start_tests, instance)
> +            return instance, startup
> +
> +        startups = []
> +        for job in range(jobs):
> +            instance, startup = create_and_start_instance()
> +            startup.addErrback(show_error)
> +            startups.append(startup)
> +
> +        # Stop when all the instances have been started.
> +        started = DeferredList(startups)
> +        started.addBoth(lambda _: reactor.stop())
> +
> +        reactor.run()
> +
> +
>  class cmd_land(EC2Command):
>      """Land a merge proposal on Launchpad."""
>

> === modified file 'lib/devscripts/ec2test/instance.py'
> --- lib/devscripts/ec2test/instance.py	2009-10-23 20:12:27 +0000
> +++ lib/devscripts/ec2test/instance.py	2009-11-12 08:29:26 +0000
> @@ -391,8 +391,6 @@
>      def set_up_and_run(self, postmortem, shutdown, func, *args, **kw):
>          """Start, run `func` and then maybe shut down.
>
> -        :param config: A dictionary specifying details of how the instance
> -            should be run:
>          :param postmortem: If true, any exceptions will be caught and an
>              interactive session run to allow debugging the problem.
>          :param shutdown: If true, shut down the instance after `func` and
> @@ -554,6 +552,11 @@
>          self._ssh = ssh
>          self._sftp = None
>
> +    def _command_with_locale(self, cmd):
> +        # Default the locale to C to stop locale warnings (especially
> +        # from bzr) from clogging up the output.
> +        return 'export LC_ALL=C; ' + cmd

I think 'LC_ALL=C ' + cmd is more like what you mean?  You could also
execute "export LC_ALL=C" on connect.

>      @property
>      def sftp(self):
>          if self._sftp is None:
> @@ -568,6 +571,7 @@
>          :param out: A stream to write the output of the remote command to.
>          :param err: A stream to write the error of the remote command to.
>          """
> +        cmd = self._command_with_locale(cmd)
>          if out is None:
>              out = sys.stdout
>          if err is None:
> @@ -603,12 +607,17 @@
>              raise RuntimeError('Command failed: %s' % (cmd,))
>          return res
>
> +    def run_as_daemon(self, cmd):
> +        """Start `cmd` as a daemonized process on the server."""
> +        return self.perform('(%s </dev/null >/dev/null 2>/dev/null &)' % cmd)

Yay unix!

>      def run_with_ssh_agent(self, cmd, ignore_failure=False):
>          """Run 'cmd' in a subprocess.
>
>          Use this to run commands that require local SSH credentials. For
>          example, getting private branches from Launchpad.
>          """
> +        cmd = self._command_with_locale(cmd)
>          self._instance.log(
>              '%s@%s$ %s\n'
>              % (self._username, self._instance._boto_instance.id, cmd))

> === added file 'lib/devscripts/ec2test/remotenode.py'

This is really a tac file, it's only imported somewhere else for the
connect_to_node function.  It's also a complicated 550 line file with
no tests... and I haven't read it.  Sorry about that.

> --- lib/devscripts/ec2test/remotenode.py	1970-01-01 00:00:00 +0000
> +++ lib/devscripts/ec2test/remotenode.py	2009-11-12 08:29:26 +0000
> @@ -0,0 +1,531 @@
> +# Copyright 2009 Canonical Ltd.  This software is licensed under the
> +# GNU Affero General Public License version 3 (see the file LICENSE).
> +
> +__metatype__ = type
> +
> +import os
> +import tempfile
> +import time
> +
> +from itertools import count, cycle, izip
> +
> +import bzrlib.config
> +import bzrlib.email_message
> +import bzrlib.smtp_connection
> +import subunit
> +
> +from twisted.application import service, internet
> +from twisted.cred import portal, checkers
> +from twisted.conch import manhole, manhole_ssh
> +from twisted.internet import defer
> +from twisted.internet import error
> +from twisted.internet import reactor
> +from twisted.internet.protocol import (
> +    ClientCreator, ProcessProtocol, Protocol, ServerFactory, connectionDone)
> +from twisted.internet.threads import deferToThread
> +from twisted.internet.utils import getProcessOutputAndValue
> +from twisted.protocols.basic import LineOnlyReceiver
> +from twisted.spread import pb
> +from twisted.trial.reporter import TimingTextReporter
> +
> +
> +# The port that nodes in the cluster listen on for PB commands.
> +TEST_NODE_PORT = 8789
> +
> +# The port that the supervisor node listens on to receive test result data
> +# from the workers.
> +# XXX: Make this a property of the factory?
> +TEST_STREAMING_PORT = 8790
> +
> +# SSH port for inspecting the running node.
> +NODE_INSPECTION_PORT = 8791
> +
> +# For throwing stuff away.
> +DEVNULL = open(os.devnull, 'rwb+')
> +
> +
> +def split_list(stuff, num_parcels):
> +    """Split a 'stuff' into 'num_parcels' similarly-sized parcels."""
> +    parcels = [[] for i in range(num_parcels)]
> +    for thing, parcel in izip(stuff, cycle(parcels)):
> +        parcel.append(thing)
> +    return parcels
> +
> +
> +def save_list(iterable, delimiter='\n'):
> +    """Save an iterable of strings to a file, newline-separated."""
> +    fd, filename = tempfile.mkstemp()
> +    f = os.fdopen(fd, 'wb')
> +    for line in iterable:
> +        f.write(line + delimiter)
> +    f.close()
> +    return filename
> +
> +
> +class CombiningSubunitFactory(ServerFactory):
> +    """A factory for listening to subunit streams and collating results."""
> +
> +    def __init__(self, result, logger):
> +        self._result = result
> +        self._logger = logger
> +
> +    def buildProtocol(self, addr):
> +        return SubunitProtocol(self._result, self._logger)
> +
> +
> +class SubunitProtocolLogger:
> +    """Organize logging for the SubunitProtocol."""
> +
> +    def __init__(self, log_dir):
> +        self._log_dir = log_dir
> +        self._log_nums = count(1)
> +        self._logs = []
> +
> +    def open_log(self):
> +        log_index = '%02d' % self._log_nums.next()
> +        log_filename = os.path.join(
> +            self._log_dir, 'subunit%s.log' % log_index)
> +        log = open(log_filename, 'wb', buffering=1)
> +        self._logs.append(log_filename)
> +        return log
> +
> +    def close_log(self, log):
> +        log.close()
> +
> +    @property
> +    def logs(self):
> +        return iter(self._logs)
> +
> +
> +class SubunitProtocol(LineOnlyReceiver):
> +    """An implementation of subunit in Twisted, yay!"""
> +
> +    # subunit separates lines with \n.
> +    delimiter = '\n'
> +
> +    def __init__(self, result, logger):
> +        self._result = result
> +        self._logger = logger
> +
> +    def connectionMade(self):
> +        self._subunit_log = self._logger.open_log()
> +        self._subunit = subunit.TestProtocolServer(self._result, DEVNULL)
> +
> +    def lineReceived(self, line):
> +        # subunit.TestProtocolServer expects the line-ending to be
> +        # present.
> +        line = line + self.delimiter
> +        self._subunit_log.write(line)
> +        self._subunit.lineReceived(line)
> +
> +    def connectionLost(self, reason=connectionDone):
> +        self._subunit.lostConnection()
> +        del self._subunit
> +        self._logger.close_log(self._subunit_log)
> +        del self._subunit_log
> +
> +
> +class TestProcessProtocol(ProcessProtocol):
> +
> +    # XXX: TestProcessProtocol isn't a great name for this. It's really an
> +    # adapter from a line receiver to a process protocol, plus a deferred that
> +    # fires on process end. Doesn't have anything to do with tests really.
> +
> +    # XXX: Timeouts!
> +
> +    def __init__(self, deferred, remote_transport):
> +        self._fire_when_done = deferred
> +        self._remote_transport = remote_transport
> +
> +    def connectionMade(self):
> +        print "Test process started."
> +        self._checkpoint = time.time()
> +
> +    def outReceived(self, data):
> +        self._remote_transport.write(data)
> +        # Say something periodically to prevent watchdogs from killing us.
> +        if time.time() - self._checkpoint > 60:
> +            print "Test process running."
> +            self._checkpoint = time.time()
> +
> +    def processEnded(self, reason):
> +        if self._fire_when_done is not None:
> +            d, self._fire_when_done = self._fire_when_done, None
> +            if reason.check(error.ProcessDone):
> +                d.callback(None)
> +            else:
> +                d.callback(reason)
> +        self._remote_transport.loseConnection()
> +
> +
> +class AccountingTestResult(TimingTextReporter):
> +
> +    def __init__(self, callback, stream):
> +        self._callback = callback
> +        super(AccountingTestResult, self).__init__(stream)
> +
> +    def stopTest(self, test):
> +        self._callback(test)
> +        super(AccountingTestResult, self).stopTest(test)
> +
> +
> +def connect_to_node(address, port=TEST_NODE_PORT, reactor=reactor):
> +    """Get the PB node at 'address'."""
> +    factory = pb.PBClientFactory()
> +    reactor.connectTCP(address, port, factory)
> +    return factory.getRootObject()
> +
> +
> +def connect_for_writing(reactor, host, port):
> +    d = ClientCreator(reactor, Protocol).connectTCP(host, port)
> +    return d.addCallback(lambda protocol: protocol.transport)
> +
> +
> +def run_process(executable, args=(), path=None, reactor=None):
> +    """Run a process and raise an error if it fails."""
> +    print 'Running', executable, args
> +
> +    d = getProcessOutputAndValue(
> +        executable, args=args, env=os.environ,
> +        path=path, reactor=reactor)
> +
> +    def check_result((out, err, code)):
> +        print 'Done'
> +        if code != 0:
> +            # XXX: Is this the best exception to raise?
> +            raise RuntimeError(
> +                "%s failed with code %d" % (executable, code), out, err)
> +        return out
> +
> +    return d.addCallback(check_result)
> +
> +
> +class TreeBuilder:
> +    """Builds the tree once, guarded by a lock."""
> +
> +    # XXX: This could easily be made more general, i.e. ExecuteOnce. In fact,
> +    # there probably is already something similar in Twisted.
> +
> +    def __init__(self, build_dir):
> +        self._lock = defer.DeferredLock()
> +        self._build_dir = build_dir
> +        self._build_result = None
> +
> +    def _maybe_build(self, lock):
> +        if self._build_result is None:
> +            build = run_process('/usr/bin/make', path=self._build_dir)
> +            def built(result):
> +                self._build_result = result
> +                lock.release()
> +            return build.addBoth(built)
> +        else:
> +            lock.release()
> +            return defer.succeed(self._build_result)
> +
> +    def build(self):
> +        return self._lock.acquire().addCallback(self._maybe_build)
> +
> +
> +class CloseableDeferredQueue(defer.DeferredQueue):
> +    """A `DeferredQueue` that can be closed.
> +
> +    Once it has been closed by `queue.close()`, and once there are no
> +    pending items in the queue, `queue.get()` will always return a
> +    Deferred that has already been fired with None.
> +    """
> +
> +    def __init__(self, size=None, backlog=None):
> +        super(CloseableDeferredQueue, self).__init__(size, backlog)
> +        self.closed = False
> +
> +    def close(self):
> +        """Close this queue, and fire None to any waiting processes."""
> +        self.closed = True
> +        while self.waiting:
> +            self.waiting.pop().callback(None)
> +
> +    def get(self):
> +        d = super(CloseableDeferredQueue, self).get()
> +        if self.closed and not d.called:
> +            d.callback(None)
> +        return d
> +
> +
> +class TestNode(pb.Root):
> +    """A node of a cluster that runs the Launchpad test suite."""
> +
> +    def __init__(self, service, branch_dir, log_dir):
> +        self._service = service
> +        self._branch_dir = branch_dir
> +        self._log_dir = log_dir
> +        self._is_supervisor = False
> +        self._tree_builder = TreeBuilder(self._branch_dir)
> +
> +    def _run_process(self, executable, args=(), reactor=None):
> +        """Run a process and raise an error if it fails.
> +
> +        Always runs the process in the `branch_dir`.
> +        """
> +        return run_process(executable, args, self._branch_dir, reactor)
> +
> +    def _find_tests(self):
> +        """Return a list of tests that make up the Launchpad test suite.
> +
> +        :return: A Deferred that fires with a list of test ids.
> +        """
> +        d = self._tree_builder.build()
> +        test_process = os.path.join(self._branch_dir, 'bin', 'test')
> +        d.addCallback(
> +            lambda ignored: self._run_process(test_process, ['--list']))
> +        # ./bin/test --list dumps out all of the test ids with newline
> +        # separation, but it adds a Total: at the end. Drop the total so we
> +        # don't confuse our callers.
> +        return d.addCallback(
> +            lambda output: [
> +                line for line in output.splitlines()
> +                if not line.startswith('Total:')])
> +
> +    def _set_tests(self, tests):
> +        """Keep a record of the tests to be run.
> +
> +        The set of tests will be used to keep track of what tests have
> +        not yet had results. Also, using a set de-duplicates the list.
> +        """
> +        self._tests.update(tests)
> +        return self._tests
> +
> +    def _set_test_parcels(self, parcels):
> +        print "%d parcels, min size %d, max size %d." % (
> +            len(parcels), len(min(parcels, key=len)),
> +            len(max(parcels, key=len)))
> +        for parcel in parcels:
> +            self._test_parcels.put(parcel)
> +        self._test_parcels.close()
> +        return parcels
> +
> +    def _calculate_work(self, num_parcels):
> +        print 'Calculating work'
> +        d = self._find_tests()
> +        d.addCallback(self._set_tests)
> +        d.addCallback(split_list, num_parcels)
> +        d.addCallback(self._set_test_parcels)
> +        return d
> +
> +    def remote_become_supervisor(self, num_parcels):
> +        """Tell this node that it is a supervisor.
> +
> +        Note that supervisor nodes can still be workers, i.e. they can still
> +        run tests.
> +
> +        :param num_parcels: The number of parcels in which to split the work
> +            up into. For optimum sharing - assuming each parcel takes a
> +            similar amount of time to process - this should be the same as the
> +            number of workers, or a multiple thereof.
> +        """
> +        if self._is_supervisor:
> +            return
> +        self._is_supervisor = True
> +        print 'Becoming supervisor'
> +        # Tests that have not yet been run or attempted.
> +        self._tests = set()
> +        # Tests that have been attempted but for which there is no result.
> +        self._tests_not_run = set()
> +        # Parcels of tests to run on one go, for efficiency.
> +        self._test_parcels = CloseableDeferredQueue()
> +        # The result object.
> +        self._test_result_log = os.path.join(self._log_dir, 'result.log')
> +        self._test_result = AccountingTestResult(
> +            callback=lambda test: self._tests.discard(test.id()),
> +            stream=open(self._test_result_log, 'wb', buffering=1))
> +        # Listen for subunit, pushing the results into the test
> +        # result, and logging everything as we go along.
> +        self._test_logger = SubunitProtocolLogger(self._log_dir)
> +        result_service = internet.TCPServer(
> +            TEST_STREAMING_PORT, CombiningSubunitFactory(
> +                self._test_result, self._test_logger))
> +        result_service.setServiceParent(self._service)
> +        # At this point, we have been told how many parcels of work there
> +        # should be, but we do not necessarily have any workers. Calculate
> +        # each parcel of work. Work will be scheduled to workers as they ask
> +        # for it.
> +        self._calculate_work(num_parcels)
> +
> +    def remote_get_work(self):
> +        """Called by workers when they are ready for work."""
> +        assert self._is_supervisor
> +        return self._test_parcels.get()
> +
> +    def remote_done_work(self, parcel):
> +        """Called by workers to declare that they have successfully
> +        completed a parcel of work.
> +
> +        This is used to check that we have at least attempted to run
> +        every test, and also to trigger shutdown of the node.
> +
> +        Returns another parcel of work if there is any.
> +        """
> +        assert self._is_supervisor
> +        self._tests_not_run.update(
> +            self._tests.intersection(parcel))
> +        self._tests.difference_update(self._tests_not_run)
> +        if len(self._tests) == 0:
> +            self._done_as_supervisor()
> +        print ("Work done; parcel had %d tests; "
> +               "%d tests remaining; %d cannot be run." % (
> +                len(parcel), len(self._tests), len(self._tests_not_run)))
> +        return self._test_parcels.get()
> +
> +    def _do_work(self, tests):
> +        # Connect to the reporting server.
> +        connected = connect_for_writing(
> +            reactor, self._supervisor_addr, TEST_STREAMING_PORT)
> +        def run_tests(transport):
> +            tests_finished = defer.Deferred()
> +            protocol = TestProcessProtocol(tests_finished, transport)
> +            command = (
> +                '/usr/bin/make', '-o', 'clean', '-o', 'build', 'check',
> +                'VERBOSITY=--subunit --load-list %s' % save_list(tests))
> +            print "%s$ %s" % (self._branch_dir, command)
> +            reactor.spawnProcess(
> +                protocol, command[0], command,
> +                env=os.environ, path=self._branch_dir)
> +            return tests_finished
> +        connected.addCallback(run_tests)
> +        connected.addErrback(lambda failure: failure.printTraceback())
> +        connected.addBoth(
> +            lambda _: self._supervisor.callRemote('done_work', tests))
> +        connected.addCallback(self._maybe_do_work)
> +
> +    def _maybe_do_work(self, tests):
> +        if tests is None:
> +            return self._done_as_worker()
> +        else:
> +            return self._do_work(tests)
> +
> +    def remote_got_supervisor(self, supervisor_addr):
> +        """Called when the supervisor node is up and running."""
> +        print 'Got supervisor', supervisor_addr
> +        d = connect_to_node(supervisor_addr)
> +        def got_supervisor(supervisor):
> +            self._supervisor_addr = supervisor_addr
> +            self._supervisor = supervisor
> +        d.addCallback(got_supervisor)
> +        d.addCallback(lambda _: self._tree_builder.build())
> +        d.addCallback(lambda _: self._supervisor.callRemote('get_work'))
> +        d.addCallback(self._maybe_do_work)
> +
> +    def _send_summary_email(self):
> +        if self._test_result.wasSuccessful():
> +            if len(self._tests) > 0 or len(self._tests_not_run) > 0:
> +                subject = 'Test results: INCONCLUSIVE'
> +            else:
> +                subject = 'Test results: SUCCESS'
> +        else:
> +            subject = 'Tests results: FAILURE'
> +
> +        body = []
> +
> +        # Warn about tests that have not been run at all.
> +        if len(self._tests) > 0:
> +            body.append(
> +                "Warning: %d tests were not attempted. See untested.txt "
> +                "for the full list." % len(self._tests))
> +
> +        # Warn about tests that could not be run.
> +        if len(self._tests_not_run) > 0:
> +            body.append(
> +                "Warning: %d tests could not be run. See unrunnable.txt "
> +                "for the full list." % len(self._tests_not_run))
> +
> +        # Pad a bit.
> +        body.extend(['', '--', ''])
> +
> +        # XXX: Change the email address :)
> +        message = bzrlib.email_message.EmailMessage(
> +            "gavin.panella@canonical.com",
> +            ["gavin.panella@canonical.com"],
> +            subject, "\n".join(body))
> +
> +        # Attach the results log.
> +        message.add_inline_attachment(
> +            open(self._test_result_log, 'rb').read(),
> +            os.path.basename(self._test_result_log))
> +
> +        # Attach a list of tests that were not run.
> +        if len(self._tests) > 0:
> +            message.add_inline_attachment(
> +                "\n".join(sorted(self._tests)), "untested.txt")
> +
> +        # Attach a list of tests that could not be run.
> +        if len(self._tests_not_run) > 0:
> +            message.add_inline_attachment(
> +                "\n".join(sorted(self._tests_not_run)), "unrunnable.txt")
> +
> +        # Attach the subunit logs.
> +        for log_filename in self._test_logger.logs:
> +            message.add_inline_attachment(
> +                open(log_filename, 'rb').read(),
> +                os.path.basename(log_filename))
> +
> +        def send():
> +            config = bzrlib.config.GlobalConfig()
> +            connection = bzrlib.smtp_connection.SMTPConnection(config)
> +            connection.send_email(message)
> +        return deferToThread(send)
> +
> +    def _done_as_supervisor(self):
> +        d = self._send_summary_email()
> +        d.addErrback(lambda failure: failure.printTraceback())
> +        d.addBoth(lambda _: self._shutdown())
> +
> +    def _done_as_worker(self):
> +        if not self._is_supervisor:
> +            self._shutdown()
> +
> +    def _shutdown(self):
> +        # XXX: There's probably a better way of doing this, but this
> +        # seems to work reliably for now.
> +        return reactor.callLater(2, reactor.stop)
> +
> +
> +def make_root(service, node_dir):
> +    node_dir = os.path.abspath(node_dir)
> +    branch_dir = os.path.join(node_dir, 'launchpad', 'test')
> +    log_dir = os.path.join(node_dir, 'www')
> +    # XXX: Might be better for the node service to actually implement
> +    # IService.
> +    return TestNode(service, branch_dir, log_dir)
> +
> +
> +def make_node_service(node):
> +    return internet.TCPServer(TEST_NODE_PORT, pb.PBServerFactory(node))
> +
> +
> +def make_inspection_service(**namespace):
> +    def getManholeFactory(ns):
> +        realm = manhole_ssh.TerminalRealm()
> +        def getManhole(_):
> +            return manhole.ColoredManhole(ns)
> +        realm.chainedProtocolFactory.protocolFactory = getManhole
> +        p = portal.Portal(realm)
> +        p.registerChecker(
> +            checkers.InMemoryUsernamePasswordDatabaseDontUse(admin="admin"))
> +        return manhole_ssh.ConchFactory(p)
> +    return internet.TCPServer(
> +        NODE_INSPECTION_PORT, getManholeFactory(namespace))
> +
> +
> +# The Launchpad Test node application.
> +application = service.Application("Launchpad Test Node")
> +
> +# The root, controller. It's a bit of a muddle right now.
> +root = make_root(application, os.environ.get('NODE_DIR', '/var'))
> +
> +# The PB node service.
> +node_service = make_node_service(root)
> +node_service.setServiceParent(application)
> +
> +# The inspection service.
> +inspection_service = make_inspection_service(
> +    application=application, root=root)
> +inspection_service.setServiceParent(application)

> === added file 'lib/devscripts/ec2test/remotenodekiller.py'
> --- lib/devscripts/ec2test/remotenodekiller.py	1970-01-01 00:00:00 +0000
> +++ lib/devscripts/ec2test/remotenodekiller.py	2009-11-12 08:29:26 +0000
> @@ -0,0 +1,146 @@
> +#!/usr/bin/python
> +
> +import optparse
> +import sys
> +import traceback
> +
> +from glob import glob
> +from os import kill
> +from os.path import exists, getmtime
> +from subprocess import call
> +from time import sleep, time
> +
> +
> +def read_pid_from_file(filename):
> +    """Read the pid from a file.
> +
> +    IO errors are allowed to propagate, but if the first line of the
> +    file cannot converted to an int then None is returned, thus
> +    permitting odd looking pid files.
> +    """
> +    fd = open(filename, 'rb')
> +    try:
> +        pid = fd.readline().strip()
> +    finally:
> +        fd.close()
> +    try:
> +        return int(pid)
> +    except ValueError:
> +        return None
> +
> +
> +def watch(wait, pids, pid_files, log_file_patterns, expiry_age):
> +    """Watch for changes in the given resources.
> +
> +    This checks:
> +
> +      * that all the specified processes are running,
> +
> +      * that all the specified pid files exist,
> +
> +      * that the process for which the pid file exists is running, if
> +        the first line of the pid file is recognised as a pid (i.e. an
> +        integer),
> +
> +      * that at least one log file exists for each given pattern,
> +
> +      * that at least one log file for each given pattern has been
> +        written to recently.
> +
> +    """
> +    try:
> +        while True:
> +            # Don't spin round.
> +            sleep(wait)
> +            # Check that all pids are receiving signals.
> +            for pid in pids:
> +                kill(pid, 0)
> +            # If any pid file has gone, return.
> +            for filename in pid_files:
> +                if not exists(filename):
> +                    return
> +                # If a pid can be obtained from the file, check it's
> +                # running.
> +                pid = read_pid_from_file(filename)
> +                if pid is not None:
> +                    kill(pid, 0)
> +            # Check log files.
> +            expired = time() - expiry_age
> +            for log_file_pattern in log_file_patterns:
> +                log_files = glob(log_file_pattern)
> +                # If no log files exist for this pattern, return.
> +                if len(log_files) == 0:
> +                    return
> +                # If none of the log files have been written to
> +                # recently, return.
> +                log_file_max_mtime = max(
> +                    getmtime(log_file) for log_file in log_files)
> +                if log_file_max_mtime < expired:
> +                    return
> +    except KeyboardInterrupt:
> +        raise
> +    except:
> +        traceback.print_exc()
> +
> +
> +def main(args):
> +    parser = optparse.OptionParser(
> +        usage="%prog [options] command", description=(
> +            "Run a arbitrary command when it appears that a process has "
> +            "exited or stopped running."))
> +    parser.add_option(
> +        '--pid', action='append', dest='pids', type=int, help=(
> +            "A pid to watch. If the process stops running, "
> +            "the command is run."))
> +    parser.add_option(
> +        '--pid-file', action='append', dest='pid_files', help=(
> +            "A pid file to watch. If any pid file disappears, "
> +            "the command is run."))
> +    parser.add_option(
> +        '--log-file-pattern', action='append', dest='log_file_patterns',
> +        help=(
> +            "A log (or other) file to watch. If any log file disappears, "
> +            "the command is run. If any log file gets old (see "
> +            "--expiry-age), the command is run. Values specified here are "
> +            "subject to glob expansion."))
> +    parser.add_option(
> +        '--expiry-age', action='store', dest='expiry_age', help=(
> +            "The number of seconds since a log file was modified before "
> +            "it is considered to be expired. Defaults to %default seconds."))
> +    parser.add_option(
> +        '--wait', action='store', type=int, dest='wait', help=(
> +            "How long to wait between each check of the pids and logs. "
> +            "Defaults to %default seconds."))
> +    parser.set_defaults(expiry_age=600, wait=10)
> +
> +    options, command = parser.parse_args(args)
> +
> +    if options.pids is None:
> +        options.pids = []
> +    if options.pid_files is None:
> +        options.pid_files = []
> +    if options.log_file_patterns is None:
> +        options.log_file_patterns = []
> +
> +    if (len(options.pids) + len(options.pid_files) +
> +        len(options.log_file_patterns)) == 0:
> +        parser.error("Nothing to watch.")
> +    if len(command) == 0:
> +        parser.error("No command to run.")
> +    if options.expiry_age < 1:
> +        parser.error("The expiry age must be at least 1 second.")
> +    if options.wait < 1:
> +        parser.error("The wait must be at least 1 second.")
> +
> +    watch(
> +        options.wait, options.pids, options.pid_files,
> +        options.log_file_patterns, options.expiry_age)
> +
> +    return call(command)
> +
> +
> +if __name__ == '__main__':
> +    try:
> +        sys.exit(main(sys.argv[1:]))
> +    except KeyboardInterrupt:
> +        sys.exit(1)

This looks basically OK but I've gone blind by this point :-)

> === modified file 'lib/devscripts/ec2test/testrunner.py'
> --- lib/devscripts/ec2test/testrunner.py	2009-10-04 15:35:16 +0000
> +++ lib/devscripts/ec2test/testrunner.py	2009-11-12 08:29:26 +0000
> @@ -285,23 +285,22 @@
>          self.email = email
>
>          # Email configuration.
> -        if email is not None or pqm_message is not None:
> -            self._smtp_server = config.get_user_option('smtp_server')
> -            if self._smtp_server is None or self._smtp_server == 'localhost':
> -                raise ValueError(
> -                    'To send email, a remotely accessible smtp_server (and '
> -                    'smtp_username and smtp_password, if necessary) must be '
> -                    'configured in bzr.  See the SMTP server information '
> -                    'here: https://wiki.canonical.com/EmailSetup .')
> -            self._smtp_username = config.get_user_option('smtp_username')
> -            self._smtp_password = config.get_user_option('smtp_password')
> -            from_email = config.username()
> -            if not from_email:
> -                # XXX: JonathanLange 2009-10-04: Is this strictly true? I
> -                # can't actually see where this is used.
> -                raise ValueError(
> -                    'To send email, your bzr email address must be set '
> -                    '(use ``bzr whoami``).')
> +        self._smtp_server = config.get_user_option('smtp_server')
> +        if self._smtp_server is None or self._smtp_server == 'localhost':
> +            raise ValueError(
> +                'To send email, a remotely accessible smtp_server (and '
> +                'smtp_username and smtp_password, if necessary) must be '
> +                'configured in bzr.  See the SMTP server information '
> +                'here: https://wiki.canonical.com/EmailSetup .')
> +        self._smtp_username = config.get_user_option('smtp_username')
> +        self._smtp_password = config.get_user_option('smtp_password')
> +        from_email = config.username()
> +        if not from_email:
> +            # XXX: JonathanLange 2009-10-04: Is this strictly true? I
> +            # can't actually see where this is used.
> +            raise ValueError(
> +                'To send email, your bzr email address must be set '
> +                '(use ``bzr whoami``).')
>
>          self._instance = instance
>
> @@ -315,28 +314,22 @@
>      def configure_system(self):
>          user_connection = self._instance.connect()
>          as_user = user_connection.perform
> -        # Set up bazaar.conf with smtp information if necessary
> -        if self.email or self.message:
> -            as_user('mkdir .bazaar')
> -            bazaar_conf_file = user_connection.sftp.open(
> -                ".bazaar/bazaar.conf", 'w')
> -            bazaar_conf_file.write(
> -                'smtp_server = %s\n' % (self._smtp_server,))
> -            if self._smtp_username:
> -                bazaar_conf_file.write(
> -                    'smtp_username = %s\n' % (self._smtp_username,))
> -            if self._smtp_password:
> -                bazaar_conf_file.write(
> -                    'smtp_password = %s\n' % (self._smtp_password,))
> -            bazaar_conf_file.close()
> -        # Copy remote ec2-remote over
> -        self.log('Copying ec2test-remote.py to remote machine.\n')
> -        user_connection.sftp.put(
> -            os.path.join(os.path.dirname(os.path.realpath(__file__)),
> -                         'ec2test-remote.py'),
> -            '/var/launchpad/ec2test-remote.py')
> +        # Set up bazaar.conf with smtp information.
> +        as_user('mkdir .bazaar')
> +        bazaar_conf_file = user_connection.sftp.open(
> +            ".bazaar/bazaar.conf", 'w')
> +        bazaar_conf_file.write(
> +            'smtp_server = %s\n' % (self._smtp_server,))
> +        if self._smtp_username:
> +            bazaar_conf_file.write(
> +                'smtp_username = %s\n' % (self._smtp_username,))
> +        if self._smtp_password:
> +            bazaar_conf_file.write(
> +                'smtp_password = %s\n' % (self._smtp_password,))
> +        bazaar_conf_file.close()
>          # Set up launchpad login and email
>          as_user('bzr launchpad-login %s' % (self._launchpad_login,))
> +        # Done.

As above, I can't see any more, but this *looks* like you
conditionally set up _smtp_server & co (which makes sense, I think)
but unconditionally using _smtp_server to set up bazaar.conf (which
doesn't seem to make sense given the previous).

>          user_connection.close()
>
>      def prepare_tests(self):
> @@ -451,6 +444,14 @@
>          self.prepare_tests()
>          user_connection = self._instance.connect()
>
> +        # Copy remote ec2-remote over.
> +        self.log(
> +            'Copying remote.py to ec2test-remote.py on remote machine.\n')
> +        user_connection.sftp.put(
> +            os.path.join(
> +                os.path.dirname(os.path.realpath(__file__)), 'remote.py'),
> +            '/var/launchpad/ec2test-remote.py')

You know (and this isn't new) we actually have ec2test-remote.py on
the remote system now, because we have a launchpad branch...

>          # Make sure we activate the failsafe --shutdown feature.  This will
>          # make the server shut itself down after the test run completes, or
>          # if the test harness suffers a critical failure.
> @@ -537,3 +538,43 @@
>              # ec2test-remote.py wants the extra options to be after a double-
>              # dash.
>              return ('--', self.test_options)
> +
> +    def run_node(self):
> +        """Start the test node."""
> +        self.configure_system()
> +        self.prepare_tests()
> +        user_connection = self._instance.connect()
> +        # Install Twisted.
> +        user_connection.perform(
> +            'sudo aptitude -y install'
> +            ' python-twisted-core python-twisted-conch')
> +        # Get Subunit. This is ugly, but I can't find a better way of
> +        # doing it right now. Revision 90 is known to work.
> +        user_connection.run_with_ssh_agent(
> +            'bzr branch -r90 lp:subunit ~/subunit')

Yeeeeeeeeeeeeeeeeech!

I think it would be better to make a new image that has this stuff
installed, at least the packages.  update-image should make this easy
:-)

> +        # Copy remotenode.py and remotenodekiller.py over.
> +        self.log('Copying remotenode.py to remote machine.\n')
> +        from_dir = os.path.dirname(os.path.realpath(__file__))
> +        user_connection.sftp.put(
> +            os.path.join(from_dir, 'remotenode.py'),
> +            '/var/launchpad/remotenode.py')
> +        user_connection.sftp.put(
> +            os.path.join(from_dir, 'remotenodekiller.py'),
> +            '/var/launchpad/remotenodekiller.py')

This was what inspired the thought "these files are actually already
there..." mentioned above.

> +        # Start the remote node.
> +        user_connection.perform(
> +            'PYTHONPATH=~/subunit/python twistd'
> +            ' --python /var/launchpad/remotenode.py'
> +            ' --pidfile /var/launchpad/remotenode.pid'
> +            ' --logfile /var/www/remotenode.log'
> +            ' --umask 022')
> +        # Daemonize the killer.
> +        user_connection.run_as_daemon(
> +            "python /var/launchpad/remotenodekiller.py"
> +            " --pid-file /var/launchpad/remotenode.pid"
> +            " --log-file-pattern '/var/www/*.log*'"
> +            " -- sudo shutdown -h now")
> +        # Remove the index.html from the www directory so we can see
> +        # the logs being written therein.
> +        user_connection.perform('rm /var/www/index.html')
> +        user_connection.close()

> === added file 'utilities/pb-shell'
> --- utilities/pb-shell	1970-01-01 00:00:00 +0000
> +++ utilities/pb-shell	2009-11-12 08:29:26 +0000
> @@ -0,0 +1,89 @@
> +#!/usr/bin/python
> +"""A interactive PB (Perspective Broker) shell."""
> +
> +__metatype__ = type
> +
> +import code
> +
> +from functools import partial
> +
> +from twisted.internet import reactor, threads
> +from twisted.spread import pb
> +
> +
> +def block(func, *args, **kwargs):
> +    """Run a function in the reactor and wait for the result.
> +
> +    If the result is a `pb.RemoteReference`, wrap it with a
> +    `BlockingRemoteReference`.
> +    """
> +    result = threads.blockingCallFromThread(
> +        reactor, func, *args, **kwargs)
> +    if isinstance(result, pb.RemoteReference):
> +        result = BlockingRemoteReference(result)
> +    return result
> +
> +
> +def blocking(wrapped):
> +    """A decorator to turn an async PB call into a blocking one."""
> +    return partial(block, wrapped)
> +
> +
> +class BlockingRemoteReference:
> +    """A blocking version of a `pb.RemoteReference`.
> +
> +    Accessing any attribute (excluding those starting with an
> +    underscore) returns a blocking wrapper around a `callRemote` for
> +    the given name.
> +    """
> +
> +    def __init__(self, reference):
> +        self.reference = reference
> +
> +    def __getattr__(self, name):
> +        if name.startswith('_'):
> +            raise AttributeError(name)
> +        return blocking(
> +            partial(self.reference.callRemote, name))
> +
> +
> +def disaster(failure):
> +    failure.printTraceback()
> +    return failure
> +
> +
> +def connect(host, port=8789):
> +    """Connect to a PB server, returning the root object."""
> +    # This function blocks, and is meant to be called from *outside*
> +    # of the event loop.
> +    def _connect():
> +        """Connect to a PB server, returning the root object."""
> +        factory = pb.PBClientFactory()
> +        reactor.connectTCP(host, port, factory)
> +        d = factory.getRootObject()
> +        d.addErrback(disaster)
> +        return d
> +    return block(_connect)
> +
> +
> +def interact(ns):
> +    try:
> +        from IPython.ipapi import launch_new_instance
> +    except ImportError:
> +        return code.interact(banner='', local=ns)
> +    else:
> +        return launch_new_instance(ns)
> +
> +
> +def console():
> +    """Start the interactive shell."""
> +    commands = {'connect': connect}
> +    try:
> +        interact(commands)
> +    finally:
> +        reactor.callFromThread(reactor.stop)
> +
> +
> +if __name__ == '__main__':
> +    reactor.callInThread(console)
> +    reactor.run()

I hope to say more tomorrow.

Cheers,
mwh

review: Abstain

lp:~allenap/launchpad/ec2-parry updated on 2009-11-12

9826. By Gavin Panella on 2009-11-12: Move imports to module level.
9827. By Gavin Panella on 2009-11-12: Use DeferredLock.run() where possible.
9828. By Gavin Panella on 2009-11-12: Return the deferred for starting the worker.
9829. By Gavin Panella on 2009-11-12: Just prepend LC_ALL=C to the command.
9830. By Gavin Panella on 2009-11-12: Only configure email when it's needed.

Revision history for this message

Gavin Panella (allenap) wrote on 2009-11-12:

Download full text (51.8 KiB)

On Thu, 12 Nov 2009 09:19:31 -0000
Michael Hudson <email address hidden> wrote:

> Review: Abstain
> Hi Gavin,
>
> This is about a quarter of a review -- sorry time crunch at the sprint
> hasn't let me do anything more :(
>
> Excited to see this progressing naturally, a little concerned about
> the complexity...

Cool :)

Complexity bad, but I know it's there. This branch probably represents
a mid-point in my understanding and approach to the problem. In my
head, some of this can already be removed or refactored down to
something simpler, but that's for a later branch. And maybe someone
else will do it.

>
> > === modified file 'lib/devscripts/ec2test/account.py'
> > --- lib/devscripts/ec2test/account.py 2009-10-08 14:50:19 +0000
> > +++ lib/devscripts/ec2test/account.py 2009-11-12 08:29:26 +0000
> > @@ -104,6 +104,9 @@
> > security_group.authorize('tcp', 22, 22, '%s/32' % ip)
> > security_group.authorize('tcp', 80, 80, '%s/32' % ip)
> > security_group.authorize('tcp', 443, 443, '%s/32' % ip)
> > + # Authorize Perspective Broker and Subunit. XXX: This is way
> > + # too permissive.
>
> Er, I'd say so yes.
>
> Given that what I think you _actually_ want is ec2 instances to be
> able to connect to each other on this protocol, not external machines,
> you don't want to do the authorization like this, but rather use some
> other EC2 feature I currently can't remember the name of :-) Read the
> security group docs I guess.

Yes, I agree totally. This was a hack to get things working. I think
it's possible to grant one security group access to another, but I've
not tried it. Also, machines started together as part of a reservation
or with the same security group might have different properties.

Currently the wrappers in devscripts/ec2test make it difficult to
start up multiple instances as part of a reservation, and security
groups are not exposed in the wrappers iirc, but that's just a matter
of code.

Anyway, this is definitely something I mean to fix. You've probably
seen in the rest of the code that there is quite a lot that could be
improved, simplified, or removed. But I'm keen to get this landed so
that other masochists can dogfood it.

On Thu, 12 Nov 2009 09:19:31 -0000
Michael Hudson <michael.hudson@canonical.com> wrote:

Cool :)

>
> > === modified file 'lib/devscripts/ec2test/account.py'
> > --- lib/devscripts/ec2test/account.py	2009-10-08 14:50:19 +0000
> > +++ lib/devscripts/ec2test/account.py	2009-11-12 08:29:26 +0000
> > @@ -104,6 +104,9 @@
> >          security_group.authorize('tcp', 22, 22, '%s/32' % ip)
> >          security_group.authorize('tcp', 80, 80, '%s/32' % ip)
> >          security_group.authorize('tcp', 443, 443, '%s/32' % ip)
> > +        # Authorize Perspective Broker and Subunit. XXX: This is way
> > +        # too permissive.
>
> Er, I'd say so yes.
>
> Given that what I think you _actually_ want is ec2 instances to be
> able to connect to each other on this protocol, not external machines,
> you don't want to do the authorization like this, but rather use some
> other EC2 feature I currently can't remember the name of :-) Read the
> security group docs I guess.

>
> > +        security_group.authorize('tcp', 8789, 8790, '0.0.0.0/0')
> >          for network in demo_networks:
> >              # Add missing netmask info for single ips.
> >              if '/' not in network:
>
> > === modified file 'lib/devscripts/ec2test/builtins.py'
> > --- lib/devscripts/ec2test/builtins.py	2009-10-16 11:07:37 +0000
> > +++ lib/devscripts/ec2test/builtins.py	2009-11-12 08:29:26 +0000
> > @@ -306,6 +306,112 @@
> >          instance.set_up_and_run(postmortem, not headless, runner.run_tests)
> >
> >
> > +class cmd_parallel_test(EC2Command):
> > +    """Run the test suite in ec2 in parallel."""
> > +
> > +    takes_options = [
> > +        branch_option,
> > +        trunk_option,
> > +        machine_id_option,
> > +        instance_type_option,
> > +        postmortem_option,
> > +        include_download_cache_changes_option,
> > +        Option(
> > +            'jobs', short_name='j', type=int, argname="NUM",
> > +            help=('The number of instances to start and distribute the '
> > +                  'test command across.')),
> > +        ]
> > +
> > +    takes_args = ['test_branch?']
> > +
> > +    def run(self, test_branch=None, branch=None, trunk=False, machine=None,
> > +            instance_type=DEFAULT_INSTANCE_TYPE, postmortem=False,
> > +            include_download_cache_changes=False, jobs=1):
> > +        if branch is None:
> > +            branch = []
> > +        branches, test_branch = _get_branches_and_test_branch(
> > +            trunk, branch, test_branch)
> > +
> > +        if jobs < 1:
> > +            raise BzrCommandError(
> > +                'The number of instances must be greater than zero.')
> > +
> > +        from twisted.internet import reactor
> > +        from twisted.internet.defer import DeferredList, DeferredLock
> > +        from twisted.internet.threads import deferToThread
> > +
> > +        from devscripts.ec2test.remotenode import connect_to_node
>
> I don't think there is any reason these imports can't be at the module
> level.

Moved. Sorry, that was just laziness on my part.

>
> > +        # Keep a record of workers as they start.
> > +        workers = []
> > +
> > +        # This is fired once the supervisor has started.
> > +        supervisor_startup = DeferredLock()
> > +
> > +        def show_error(failure):
> > +            failure.printTraceback()
> > +
> > +        def start_instance_and_node(instance, runner):
> > +            # Bring up the instance and start the test node service.
> > +            try:
> > +                instance.start()
> > +                runner.run_node()
> > +            except:
> > +                instance.shutdown()
> > +                raise
> > +            else:
> > +                return instance
>
> This looks a lot like instance.setup_and_run?

Oh! Yes it does. Ah, set_up_and_run() doesn't re-raise exceptions. I
want to make sure the instance gets shot in the head if there's any
question that things have gone wrong. I can't use the shutdown arg to
set_up_and_run() because it's starting a daemon.

>
> > +        def get_root(instance):
> > +            return connect_to_node(instance.hostname)
> > +
> > +        def start_tests(root, instance):
> > +            workers.append((root, instance))
> > +            if len(workers) == 1:
> > +                # Start the supervisor.
> > +                d = supervisor_startup.acquire()
> > +                d.addCallback(
> > +                    lambda _: root.callRemote('become_supervisor', jobs))
> > +                d.addCallback(lambda _: supervisor_startup.release())
>
> This can be spelt
>
>     supervisor_startup.run(root.callRemote, 'become_supervisor', jobs)
>
> I think?

How did I miss that? Thank you :)

>
> > +            # Start the worker.
> > +            supervisor_root, supervisor_instance = workers[0]
> > +            # If the supervisor is still starting, we will have to
> > +            # wait for this lock. Once we have it, we release it
> > +            # immediately so other workers can start.
> > +            d = supervisor_startup.acquire()
> > +            d.addCallback(lambda lock: lock.release())
> > +            d.addCallback(lambda _: root.callRemote(
> > +                    'got_supervisor', supervisor_instance.hostname))
>
> This seems a little weird.  Maybe:
>
>     def start_worker():
>         # Start the worker.  Note that we don't return the Deferred
>         # that callRemote returns -- the various workers can start
>         # in parallel.
>         root.callRemote('got_supervisor', supervisor_instance.hostname)
>     supervisor_startup.run(start_worker)
>
> I also wonder if having the supervisor_startup essentially be a single
> deferred that is fired with the supervisor_instance and adding a bunch
> of callbacks that do:
>
>     def start_worker(supervisor_instance):
>         # Start the worker.  Note that we don't return the Deferred
>         # that callRemote returns -- the various workers can start
>         # in parallel.
>         root.callRemote('got_supervisor', supervisor_instance.hostname)
>         # Pass on the instance to the next callback.
>         return supervisor_instance
>
> Though that's a little odd somehow too.

Yeah, this is a pattern I ran into in remotenode.py too, where one or
more things needs to wait for something to happen (in that case, for
the tree to be built). I don't want to chain them into a single
deferred because, well, it's odd and because I'd need to pay attention
to errors more than I have. Also, I want to wait for the
'got_supervisor' call to complete (I stuff them all into a
DeferredList at the moment).

I've done the following for now:

# Wait for the supervisor to start.
            d = supervisor_startup.run(lambda _: None)
            # Call the node *outside* of the lock.
            return d.addCallback(lambda _: root.callRemote(
                    'got_supervisor', supervisor_instance.hostname))

>
> > +            return d
> > +
> > +        def create_and_start_instance():
> > +            session_name = EC2SessionName.make(EC2TestRunner.name)
> > +            instance = EC2Instance.make(
> > +                session_name, instance_type, machine)
> > +            runner = EC2TestRunner(
> > +                test_branch, branches=branches,
> > +                include_download_cache_changes=include_download_cache_changes,
> > +                instance=instance, launchpad_login=instance._launchpad_login)
> > +            # Do the startup in a thread because boto is
> > +            # blocking. XXX: Use txAWS here instead?
>
> I guess we could, but let's not bother for now :)

That was more of a reminder to myself to go look at txAWS. Which seems
quite nice. But I agree, not now :)

>
> > +            startup = deferToThread(start_instance_and_node, instance, runner)
> > +            startup.addCallback(get_root)
> > +            startup.addCallback(start_tests, instance)
> > +            return instance, startup
> > +
> > +        startups = []
> > +        for job in range(jobs):
> > +            instance, startup = create_and_start_instance()
> > +            startup.addErrback(show_error)
> > +            startups.append(startup)
> > +
> > +        # Stop when all the instances have been started.
> > +        started = DeferredList(startups)
> > +        started.addBoth(lambda _: reactor.stop())
> > +
> > +        reactor.run()
> > +
> > +
> >  class cmd_land(EC2Command):
> >      """Land a merge proposal on Launchpad."""
> >
>
> > === modified file 'lib/devscripts/ec2test/instance.py'
> > --- lib/devscripts/ec2test/instance.py	2009-10-23 20:12:27 +0000
> > +++ lib/devscripts/ec2test/instance.py	2009-11-12 08:29:26 +0000
> > @@ -391,8 +391,6 @@
> >      def set_up_and_run(self, postmortem, shutdown, func, *args, **kw):
> >          """Start, run `func` and then maybe shut down.
> >
> > -        :param config: A dictionary specifying details of how the instance
> > -            should be run:
> >          :param postmortem: If true, any exceptions will be caught and an
> >              interactive session run to allow debugging the problem.
> >          :param shutdown: If true, shut down the instance after `func` and
> > @@ -554,6 +552,11 @@
> >          self._ssh = ssh
> >          self._sftp = None
> >
> > +    def _command_with_locale(self, cmd):
> > +        # Default the locale to C to stop locale warnings (especially
> > +        # from bzr) from clogging up the output.
> > +        return 'export LC_ALL=C; ' + cmd
>
> I think 'LC_ALL=C ' + cmd is more like what you mean?  You could also
> execute "export LC_ALL=C" on connect.

Each call to perform() opens a new session, so I don't think it'll
work to do it on connect, but I've changed the invocation to be like
"LC_ALL=C cmd" as you suggest.

>
> >      @property
> >      def sftp(self):
> >          if self._sftp is None:
> > @@ -568,6 +571,7 @@
> >          :param out: A stream to write the output of the remote command to.
> >          :param err: A stream to write the error of the remote command to.
> >          """
> > +        cmd = self._command_with_locale(cmd)
> >          if out is None:
> >              out = sys.stdout
> >          if err is None:
> > @@ -603,12 +607,17 @@
> >              raise RuntimeError('Command failed: %s' % (cmd,))
> >          return res
> >
> > +    def run_as_daemon(self, cmd):
> > +        """Start `cmd` as a daemonized process on the server."""
> > +        return self.perform('(%s </dev/null >/dev/null 2>/dev/null &)' % cmd)
>
> Yay unix!

Yay indeed :)

>
> >      def run_with_ssh_agent(self, cmd, ignore_failure=False):
> >          """Run 'cmd' in a subprocess.
> >
> >          Use this to run commands that require local SSH credentials. For
> >          example, getting private branches from Launchpad.
> >          """
> > +        cmd = self._command_with_locale(cmd)
> >          self._instance.log(
> >              '%s@%s$ %s\n'
> >              % (self._username, self._instance._boto_instance.id, cmd))
>
> > === added file 'lib/devscripts/ec2test/remotenode.py'
>
> This is really a tac file, it's only imported somewhere else for the
> connect_to_node function.  It's also a complicated 550 line file with
> no tests... and I haven't read it.  Sorry about that.
>
> > --- lib/devscripts/ec2test/remotenode.py	1970-01-01 00:00:00 +0000
> > +++ lib/devscripts/ec2test/remotenode.py	2009-11-12 08:29:26 +0000
> > @@ -0,0 +1,531 @@
> > +# Copyright 2009 Canonical Ltd.  This software is licensed under the
> > +# GNU Affero General Public License version 3 (see the file LICENSE).
> > +
> > +__metatype__ = type
> > +
> > +import os
> > +import tempfile
> > +import time
> > +
> > +from itertools import count, cycle, izip
> > +
> > +import bzrlib.config
> > +import bzrlib.email_message
> > +import bzrlib.smtp_connection
> > +import subunit
> > +
> > +from twisted.application import service, internet
> > +from twisted.cred import portal, checkers
> > +from twisted.conch import manhole, manhole_ssh
> > +from twisted.internet import defer
> > +from twisted.internet import error
> > +from twisted.internet import reactor
> > +from twisted.internet.protocol import (
> > +    ClientCreator, ProcessProtocol, Protocol, ServerFactory, connectionDone)
> > +from twisted.internet.threads import deferToThread
> > +from twisted.internet.utils import getProcessOutputAndValue
> > +from twisted.protocols.basic import LineOnlyReceiver
> > +from twisted.spread import pb
> > +from twisted.trial.reporter import TimingTextReporter
> > +
> > +
> > +# The port that nodes in the cluster listen on for PB commands.
> > +TEST_NODE_PORT = 8789
> > +
> > +# The port that the supervisor node listens on to receive test result data
> > +# from the workers.
> > +# XXX: Make this a property of the factory?
> > +TEST_STREAMING_PORT = 8790
> > +
> > +# SSH port for inspecting the running node.
> > +NODE_INSPECTION_PORT = 8791
> > +
> > +# For throwing stuff away.
> > +DEVNULL = open(os.devnull, 'rwb+')
> > +
> > +
> > +def split_list(stuff, num_parcels):
> > +    """Split a 'stuff' into 'num_parcels' similarly-sized parcels."""
> > +    parcels = [[] for i in range(num_parcels)]
> > +    for thing, parcel in izip(stuff, cycle(parcels)):
> > +        parcel.append(thing)
> > +    return parcels
> > +
> > +
> > +def save_list(iterable, delimiter='\n'):
> > +    """Save an iterable of strings to a file, newline-separated."""
> > +    fd, filename = tempfile.mkstemp()
> > +    f = os.fdopen(fd, 'wb')
> > +    for line in iterable:
> > +        f.write(line + delimiter)
> > +    f.close()
> > +    return filename
> > +
> > +
> > +class CombiningSubunitFactory(ServerFactory):
> > +    """A factory for listening to subunit streams and collating results."""
> > +
> > +    def __init__(self, result, logger):
> > +        self._result = result
> > +        self._logger = logger
> > +
> > +    def buildProtocol(self, addr):
> > +        return SubunitProtocol(self._result, self._logger)
> > +
> > +
> > +class SubunitProtocolLogger:
> > +    """Organize logging for the SubunitProtocol."""
> > +
> > +    def __init__(self, log_dir):
> > +        self._log_dir = log_dir
> > +        self._log_nums = count(1)
> > +        self._logs = []
> > +
> > +    def open_log(self):
> > +        log_index = '%02d' % self._log_nums.next()
> > +        log_filename = os.path.join(
> > +            self._log_dir, 'subunit%s.log' % log_index)
> > +        log = open(log_filename, 'wb', buffering=1)
> > +        self._logs.append(log_filename)
> > +        return log
> > +
> > +    def close_log(self, log):
> > +        log.close()
> > +
> > +    @property
> > +    def logs(self):
> > +        return iter(self._logs)
> > +
> > +
> > +class SubunitProtocol(LineOnlyReceiver):
> > +    """An implementation of subunit in Twisted, yay!"""
> > +
> > +    # subunit separates lines with \n.
> > +    delimiter = '\n'
> > +
> > +    def __init__(self, result, logger):
> > +        self._result = result
> > +        self._logger = logger
> > +
> > +    def connectionMade(self):
> > +        self._subunit_log = self._logger.open_log()
> > +        self._subunit = subunit.TestProtocolServer(self._result, DEVNULL)
> > +
> > +    def lineReceived(self, line):
> > +        # subunit.TestProtocolServer expects the line-ending to be
> > +        # present.
> > +        line = line + self.delimiter
> > +        self._subunit_log.write(line)
> > +        self._subunit.lineReceived(line)
> > +
> > +    def connectionLost(self, reason=connectionDone):
> > +        self._subunit.lostConnection()
> > +        del self._subunit
> > +        self._logger.close_log(self._subunit_log)
> > +        del self._subunit_log
> > +
> > +
> > +class TestProcessProtocol(ProcessProtocol):
> > +
> > +    # XXX: TestProcessProtocol isn't a great name for this. It's really an
> > +    # adapter from a line receiver to a process protocol, plus a deferred that
> > +    # fires on process end. Doesn't have anything to do with tests really.
> > +
> > +    # XXX: Timeouts!
> > +
> > +    def __init__(self, deferred, remote_transport):
> > +        self._fire_when_done = deferred
> > +        self._remote_transport = remote_transport
> > +
> > +    def connectionMade(self):
> > +        print "Test process started."
> > +        self._checkpoint = time.time()
> > +
> > +    def outReceived(self, data):
> > +        self._remote_transport.write(data)
> > +        # Say something periodically to prevent watchdogs from killing us.
> > +        if time.time() - self._checkpoint > 60:
> > +            print "Test process running."
> > +            self._checkpoint = time.time()
> > +
> > +    def processEnded(self, reason):
> > +        if self._fire_when_done is not None:
> > +            d, self._fire_when_done = self._fire_when_done, None
> > +            if reason.check(error.ProcessDone):
> > +                d.callback(None)
> > +            else:
> > +                d.callback(reason)
> > +        self._remote_transport.loseConnection()
> > +
> > +
> > +class AccountingTestResult(TimingTextReporter):
> > +
> > +    def __init__(self, callback, stream):
> > +        self._callback = callback
> > +        super(AccountingTestResult, self).__init__(stream)
> > +
> > +    def stopTest(self, test):
> > +        self._callback(test)
> > +        super(AccountingTestResult, self).stopTest(test)
> > +
> > +
> > +def connect_to_node(address, port=TEST_NODE_PORT, reactor=reactor):
> > +    """Get the PB node at 'address'."""
> > +    factory = pb.PBClientFactory()
> > +    reactor.connectTCP(address, port, factory)
> > +    return factory.getRootObject()
> > +
> > +
> > +def connect_for_writing(reactor, host, port):
> > +    d = ClientCreator(reactor, Protocol).connectTCP(host, port)
> > +    return d.addCallback(lambda protocol: protocol.transport)
> > +
> > +
> > +def run_process(executable, args=(), path=None, reactor=None):
> > +    """Run a process and raise an error if it fails."""
> > +    print 'Running', executable, args
> > +
> > +    d = getProcessOutputAndValue(
> > +        executable, args=args, env=os.environ,
> > +        path=path, reactor=reactor)
> > +
> > +    def check_result((out, err, code)):
> > +        print 'Done'
> > +        if code != 0:
> > +            # XXX: Is this the best exception to raise?
> > +            raise RuntimeError(
> > +                "%s failed with code %d" % (executable, code), out, err)
> > +        return out
> > +
> > +    return d.addCallback(check_result)
> > +
> > +
> > +class TreeBuilder:
> > +    """Builds the tree once, guarded by a lock."""
> > +
> > +    # XXX: This could easily be made more general, i.e. ExecuteOnce. In fact,
> > +    # there probably is already something similar in Twisted.
> > +
> > +    def __init__(self, build_dir):
> > +        self._lock = defer.DeferredLock()
> > +        self._build_dir = build_dir
> > +        self._build_result = None
> > +
> > +    def _maybe_build(self, lock):
> > +        if self._build_result is None:
> > +            build = run_process('/usr/bin/make', path=self._build_dir)
> > +            def built(result):
> > +                self._build_result = result
> > +                lock.release()
> > +            return build.addBoth(built)
> > +        else:
> > +            lock.release()
> > +            return defer.succeed(self._build_result)
> > +
> > +    def build(self):
> > +        return self._lock.acquire().addCallback(self._maybe_build)
> > +
> > +
> > +class CloseableDeferredQueue(defer.DeferredQueue):
> > +    """A `DeferredQueue` that can be closed.
> > +
> > +    Once it has been closed by `queue.close()`, and once there are no
> > +    pending items in the queue, `queue.get()` will always return a
> > +    Deferred that has already been fired with None.
> > +    """
> > +
> > +    def __init__(self, size=None, backlog=None):
> > +        super(CloseableDeferredQueue, self).__init__(size, backlog)
> > +        self.closed = False
> > +
> > +    def close(self):
> > +        """Close this queue, and fire None to any waiting processes."""
> > +        self.closed = True
> > +        while self.waiting:
> > +            self.waiting.pop().callback(None)
> > +
> > +    def get(self):
> > +        d = super(CloseableDeferredQueue, self).get()
> > +        if self.closed and not d.called:
> > +            d.callback(None)
> > +        return d
> > +
> > +
> > +class TestNode(pb.Root):
> > +    """A node of a cluster that runs the Launchpad test suite."""
> > +
> > +    def __init__(self, service, branch_dir, log_dir):
> > +        self._service = service
> > +        self._branch_dir = branch_dir
> > +        self._log_dir = log_dir
> > +        self._is_supervisor = False
> > +        self._tree_builder = TreeBuilder(self._branch_dir)
> > +
> > +    def _run_process(self, executable, args=(), reactor=None):
> > +        """Run a process and raise an error if it fails.
> > +
> > +        Always runs the process in the `branch_dir`.
> > +        """
> > +        return run_process(executable, args, self._branch_dir, reactor)
> > +
> > +    def _find_tests(self):
> > +        """Return a list of tests that make up the Launchpad test suite.
> > +
> > +        :return: A Deferred that fires with a list of test ids.
> > +        """
> > +        d = self._tree_builder.build()
> > +        test_process = os.path.join(self._branch_dir, 'bin', 'test')
> > +        d.addCallback(
> > +            lambda ignored: self._run_process(test_process, ['--list']))
> > +        # ./bin/test --list dumps out all of the test ids with newline
> > +        # separation, but it adds a Total: at the end. Drop the total so we
> > +        # don't confuse our callers.
> > +        return d.addCallback(
> > +            lambda output: [
> > +                line for line in output.splitlines()
> > +                if not line.startswith('Total:')])
> > +
> > +    def _set_tests(self, tests):
> > +        """Keep a record of the tests to be run.
> > +
> > +        The set of tests will be used to keep track of what tests have
> > +        not yet had results. Also, using a set de-duplicates the list.
> > +        """
> > +        self._tests.update(tests)
> > +        return self._tests
> > +
> > +    def _set_test_parcels(self, parcels):
> > +        print "%d parcels, min size %d, max size %d." % (
> > +            len(parcels), len(min(parcels, key=len)),
> > +            len(max(parcels, key=len)))
> > +        for parcel in parcels:
> > +            self._test_parcels.put(parcel)
> > +        self._test_parcels.close()
> > +        return parcels
> > +
> > +    def _calculate_work(self, num_parcels):
> > +        print 'Calculating work'
> > +        d = self._find_tests()
> > +        d.addCallback(self._set_tests)
> > +        d.addCallback(split_list, num_parcels)
> > +        d.addCallback(self._set_test_parcels)
> > +        return d
> > +
> > +    def remote_become_supervisor(self, num_parcels):
> > +        """Tell this node that it is a supervisor.
> > +
> > +        Note that supervisor nodes can still be workers, i.e. they can still
> > +        run tests.
> > +
> > +        :param num_parcels: The number of parcels in which to split the work
> > +            up into. For optimum sharing - assuming each parcel takes a
> > +            similar amount of time to process - this should be the same as the
> > +            number of workers, or a multiple thereof.
> > +        """
> > +        if self._is_supervisor:
> > +            return
> > +        self._is_supervisor = True
> > +        print 'Becoming supervisor'
> > +        # Tests that have not yet been run or attempted.
> > +        self._tests = set()
> > +        # Tests that have been attempted but for which there is no result.
> > +        self._tests_not_run = set()
> > +        # Parcels of tests to run on one go, for efficiency.
> > +        self._test_parcels = CloseableDeferredQueue()
> > +        # The result object.
> > +        self._test_result_log = os.path.join(self._log_dir, 'result.log')
> > +        self._test_result = AccountingTestResult(
> > +            callback=lambda test: self._tests.discard(test.id()),
> > +            stream=open(self._test_result_log, 'wb', buffering=1))
> > +        # Listen for subunit, pushing the results into the test
> > +        # result, and logging everything as we go along.
> > +        self._test_logger = SubunitProtocolLogger(self._log_dir)
> > +        result_service = internet.TCPServer(
> > +            TEST_STREAMING_PORT, CombiningSubunitFactory(
> > +                self._test_result, self._test_logger))
> > +        result_service.setServiceParent(self._service)
> > +        # At this point, we have been told how many parcels of work there
> > +        # should be, but we do not necessarily have any workers. Calculate
> > +        # each parcel of work. Work will be scheduled to workers as they ask
> > +        # for it.
> > +        self._calculate_work(num_parcels)
> > +
> > +    def remote_get_work(self):
> > +        """Called by workers when they are ready for work."""
> > +        assert self._is_supervisor
> > +        return self._test_parcels.get()
> > +
> > +    def remote_done_work(self, parcel):
> > +        """Called by workers to declare that they have successfully
> > +        completed a parcel of work.
> > +
> > +        This is used to check that we have at least attempted to run
> > +        every test, and also to trigger shutdown of the node.
> > +
> > +        Returns another parcel of work if there is any.
> > +        """
> > +        assert self._is_supervisor
> > +        self._tests_not_run.update(
> > +            self._tests.intersection(parcel))
> > +        self._tests.difference_update(self._tests_not_run)
> > +        if len(self._tests) == 0:
> > +            self._done_as_supervisor()
> > +        print ("Work done; parcel had %d tests; "
> > +               "%d tests remaining; %d cannot be run." % (
> > +                len(parcel), len(self._tests), len(self._tests_not_run)))
> > +        return self._test_parcels.get()
> > +
> > +    def _do_work(self, tests):
> > +        # Connect to the reporting server.
> > +        connected = connect_for_writing(
> > +            reactor, self._supervisor_addr, TEST_STREAMING_PORT)
> > +        def run_tests(transport):
> > +            tests_finished = defer.Deferred()
> > +            protocol = TestProcessProtocol(tests_finished, transport)
> > +            command = (
> > +                '/usr/bin/make', '-o', 'clean', '-o', 'build', 'check',
> > +                'VERBOSITY=--subunit --load-list %s' % save_list(tests))
> > +            print "%s$ %s" % (self._branch_dir, command)
> > +            reactor.spawnProcess(
> > +                protocol, command[0], command,
> > +                env=os.environ, path=self._branch_dir)
> > +            return tests_finished
> > +        connected.addCallback(run_tests)
> > +        connected.addErrback(lambda failure: failure.printTraceback())
> > +        connected.addBoth(
> > +            lambda _: self._supervisor.callRemote('done_work', tests))
> > +        connected.addCallback(self._maybe_do_work)
> > +
> > +    def _maybe_do_work(self, tests):
> > +        if tests is None:
> > +            return self._done_as_worker()
> > +        else:
> > +            return self._do_work(tests)
> > +
> > +    def remote_got_supervisor(self, supervisor_addr):
> > +        """Called when the supervisor node is up and running."""
> > +        print 'Got supervisor', supervisor_addr
> > +        d = connect_to_node(supervisor_addr)
> > +        def got_supervisor(supervisor):
> > +            self._supervisor_addr = supervisor_addr
> > +            self._supervisor = supervisor
> > +        d.addCallback(got_supervisor)
> > +        d.addCallback(lambda _: self._tree_builder.build())
> > +        d.addCallback(lambda _: self._supervisor.callRemote('get_work'))
> > +        d.addCallback(self._maybe_do_work)
> > +
> > +    def _send_summary_email(self):
> > +        if self._test_result.wasSuccessful():
> > +            if len(self._tests) > 0 or len(self._tests_not_run) > 0:
> > +                subject = 'Test results: INCONCLUSIVE'
> > +            else:
> > +                subject = 'Test results: SUCCESS'
> > +        else:
> > +            subject = 'Tests results: FAILURE'
> > +
> > +        body = []
> > +
> > +        # Warn about tests that have not been run at all.
> > +        if len(self._tests) > 0:
> > +            body.append(
> > +                "Warning: %d tests were not attempted. See untested.txt "
> > +                "for the full list." % len(self._tests))
> > +
> > +        # Warn about tests that could not be run.
> > +        if len(self._tests_not_run) > 0:
> > +            body.append(
> > +                "Warning: %d tests could not be run. See unrunnable.txt "
> > +                "for the full list." % len(self._tests_not_run))
> > +
> > +        # Pad a bit.
> > +        body.extend(['', '--', ''])
> > +
> > +        # XXX: Change the email address :)
> > +        message = bzrlib.email_message.EmailMessage(
> > +            "gavin.panella@canonical.com",
> > +            ["gavin.panella@canonical.com"],
> > +            subject, "\n".join(body))
> > +
> > +        # Attach the results log.
> > +        message.add_inline_attachment(
> > +            open(self._test_result_log, 'rb').read(),
> > +            os.path.basename(self._test_result_log))
> > +
> > +        # Attach a list of tests that were not run.
> > +        if len(self._tests) > 0:
> > +            message.add_inline_attachment(
> > +                "\n".join(sorted(self._tests)), "untested.txt")
> > +
> > +        # Attach a list of tests that could not be run.
> > +        if len(self._tests_not_run) > 0:
> > +            message.add_inline_attachment(
> > +                "\n".join(sorted(self._tests_not_run)), "unrunnable.txt")
> > +
> > +        # Attach the subunit logs.
> > +        for log_filename in self._test_logger.logs:
> > +            message.add_inline_attachment(
> > +                open(log_filename, 'rb').read(),
> > +                os.path.basename(log_filename))
> > +
> > +        def send():
> > +            config = bzrlib.config.GlobalConfig()
> > +            connection = bzrlib.smtp_connection.SMTPConnection(config)
> > +            connection.send_email(message)
> > +        return deferToThread(send)
> > +
> > +    def _done_as_supervisor(self):
> > +        d = self._send_summary_email()
> > +        d.addErrback(lambda failure: failure.printTraceback())
> > +        d.addBoth(lambda _: self._shutdown())
> > +
> > +    def _done_as_worker(self):
> > +        if not self._is_supervisor:
> > +            self._shutdown()
> > +
> > +    def _shutdown(self):
> > +        # XXX: There's probably a better way of doing this, but this
> > +        # seems to work reliably for now.
> > +        return reactor.callLater(2, reactor.stop)
> > +
> > +
> > +def make_root(service, node_dir):
> > +    node_dir = os.path.abspath(node_dir)
> > +    branch_dir = os.path.join(node_dir, 'launchpad', 'test')
> > +    log_dir = os.path.join(node_dir, 'www')
> > +    # XXX: Might be better for the node service to actually implement
> > +    # IService.
> > +    return TestNode(service, branch_dir, log_dir)
> > +
> > +
> > +def make_node_service(node):
> > +    return internet.TCPServer(TEST_NODE_PORT, pb.PBServerFactory(node))
> > +
> > +
> > +def make_inspection_service(**namespace):
> > +    def getManholeFactory(ns):
> > +        realm = manhole_ssh.TerminalRealm()
> > +        def getManhole(_):
> > +            return manhole.ColoredManhole(ns)
> > +        realm.chainedProtocolFactory.protocolFactory = getManhole
> > +        p = portal.Portal(realm)
> > +        p.registerChecker(
> > +            checkers.InMemoryUsernamePasswordDatabaseDontUse(admin="admin"))
> > +        return manhole_ssh.ConchFactory(p)
> > +    return internet.TCPServer(
> > +        NODE_INSPECTION_PORT, getManholeFactory(namespace))
> > +
> > +
> > +# The Launchpad Test node application.
> > +application = service.Application("Launchpad Test Node")
> > +
> > +# The root, controller. It's a bit of a muddle right now.
> > +root = make_root(application, os.environ.get('NODE_DIR', '/var'))
> > +
> > +# The PB node service.
> > +node_service = make_node_service(root)
> > +node_service.setServiceParent(application)
> > +
> > +# The inspection service.
> > +inspection_service = make_inspection_service(
> > +    application=application, root=root)
> > +inspection_service.setServiceParent(application)
>
> > === added file 'lib/devscripts/ec2test/remotenodekiller.py'
> > --- lib/devscripts/ec2test/remotenodekiller.py	1970-01-01 00:00:00 +0000
> > +++ lib/devscripts/ec2test/remotenodekiller.py	2009-11-12 08:29:26 +0000
> > @@ -0,0 +1,146 @@
> > +#!/usr/bin/python
> > +
> > +import optparse
> > +import sys
> > +import traceback
> > +
> > +from glob import glob
> > +from os import kill
> > +from os.path import exists, getmtime
> > +from subprocess import call
> > +from time import sleep, time
> > +
> > +
> > +def read_pid_from_file(filename):
> > +    """Read the pid from a file.
> > +
> > +    IO errors are allowed to propagate, but if the first line of the
> > +    file cannot converted to an int then None is returned, thus
> > +    permitting odd looking pid files.
> > +    """
> > +    fd = open(filename, 'rb')
> > +    try:
> > +        pid = fd.readline().strip()
> > +    finally:
> > +        fd.close()
> > +    try:
> > +        return int(pid)
> > +    except ValueError:
> > +        return None
> > +
> > +
> > +def watch(wait, pids, pid_files, log_file_patterns, expiry_age):
> > +    """Watch for changes in the given resources.
> > +
> > +    This checks:
> > +
> > +      * that all the specified processes are running,
> > +
> > +      * that all the specified pid files exist,
> > +
> > +      * that the process for which the pid file exists is running, if
> > +        the first line of the pid file is recognised as a pid (i.e. an
> > +        integer),
> > +
> > +      * that at least one log file exists for each given pattern,
> > +
> > +      * that at least one log file for each given pattern has been
> > +        written to recently.
> > +
> > +    """
> > +    try:
> > +        while True:
> > +            # Don't spin round.
> > +            sleep(wait)
> > +            # Check that all pids are receiving signals.
> > +            for pid in pids:
> > +                kill(pid, 0)
> > +            # If any pid file has gone, return.
> > +            for filename in pid_files:
> > +                if not exists(filename):
> > +                    return
> > +                # If a pid can be obtained from the file, check it's
> > +                # running.
> > +                pid = read_pid_from_file(filename)
> > +                if pid is not None:
> > +                    kill(pid, 0)
> > +            # Check log files.
> > +            expired = time() - expiry_age
> > +            for log_file_pattern in log_file_patterns:
> > +                log_files = glob(log_file_pattern)
> > +                # If no log files exist for this pattern, return.
> > +                if len(log_files) == 0:
> > +                    return
> > +                # If none of the log files have been written to
> > +                # recently, return.
> > +                log_file_max_mtime = max(
> > +                    getmtime(log_file) for log_file in log_files)
> > +                if log_file_max_mtime < expired:
> > +                    return
> > +    except KeyboardInterrupt:
> > +        raise
> > +    except:
> > +        traceback.print_exc()
> > +
> > +
> > +def main(args):
> > +    parser = optparse.OptionParser(
> > +        usage="%prog [options] command", description=(
> > +            "Run a arbitrary command when it appears that a process has "
> > +            "exited or stopped running."))
> > +    parser.add_option(
> > +        '--pid', action='append', dest='pids', type=int, help=(
> > +            "A pid to watch. If the process stops running, "
> > +            "the command is run."))
> > +    parser.add_option(
> > +        '--pid-file', action='append', dest='pid_files', help=(
> > +            "A pid file to watch. If any pid file disappears, "
> > +            "the command is run."))
> > +    parser.add_option(
> > +        '--log-file-pattern', action='append', dest='log_file_patterns',
> > +        help=(
> > +            "A log (or other) file to watch. If any log file disappears, "
> > +            "the command is run. If any log file gets old (see "
> > +            "--expiry-age), the command is run. Values specified here are "
> > +            "subject to glob expansion."))
> > +    parser.add_option(
> > +        '--expiry-age', action='store', dest='expiry_age', help=(
> > +            "The number of seconds since a log file was modified before "
> > +            "it is considered to be expired. Defaults to %default seconds."))
> > +    parser.add_option(
> > +        '--wait', action='store', type=int, dest='wait', help=(
> > +            "How long to wait between each check of the pids and logs. "
> > +            "Defaults to %default seconds."))
> > +    parser.set_defaults(expiry_age=600, wait=10)
> > +
> > +    options, command = parser.parse_args(args)
> > +
> > +    if options.pids is None:
> > +        options.pids = []
> > +    if options.pid_files is None:
> > +        options.pid_files = []
> > +    if options.log_file_patterns is None:
> > +        options.log_file_patterns = []
> > +
> > +    if (len(options.pids) + len(options.pid_files) +
> > +        len(options.log_file_patterns)) == 0:
> > +        parser.error("Nothing to watch.")
> > +    if len(command) == 0:
> > +        parser.error("No command to run.")
> > +    if options.expiry_age < 1:
> > +        parser.error("The expiry age must be at least 1 second.")
> > +    if options.wait < 1:
> > +        parser.error("The wait must be at least 1 second.")
> > +
> > +    watch(
> > +        options.wait, options.pids, options.pid_files,
> > +        options.log_file_patterns, options.expiry_age)
> > +
> > +    return call(command)
> > +
> > +
> > +if __name__ == '__main__':
> > +    try:
> > +        sys.exit(main(sys.argv[1:]))
> > +    except KeyboardInterrupt:
> > +        sys.exit(1)
>
> This looks basically OK but I've gone blind by this point :-)
>
> > === modified file 'lib/devscripts/ec2test/testrunner.py'
> > --- lib/devscripts/ec2test/testrunner.py	2009-10-04 15:35:16 +0000
> > +++ lib/devscripts/ec2test/testrunner.py	2009-11-12 08:29:26 +0000
> > @@ -285,23 +285,22 @@
> >          self.email = email
> >
> >          # Email configuration.
> > -        if email is not None or pqm_message is not None:
> > -            self._smtp_server = config.get_user_option('smtp_server')
> > -            if self._smtp_server is None or self._smtp_server == 'localhost':
> > -                raise ValueError(
> > -                    'To send email, a remotely accessible smtp_server (and '
> > -                    'smtp_username and smtp_password, if necessary) must be '
> > -                    'configured in bzr.  See the SMTP server information '
> > -                    'here: https://wiki.canonical.com/EmailSetup .')
> > -            self._smtp_username = config.get_user_option('smtp_username')
> > -            self._smtp_password = config.get_user_option('smtp_password')
> > -            from_email = config.username()
> > -            if not from_email:
> > -                # XXX: JonathanLange 2009-10-04: Is this strictly true? I
> > -                # can't actually see where this is used.
> > -                raise ValueError(
> > -                    'To send email, your bzr email address must be set '
> > -                    '(use ``bzr whoami``).')
> > +        self._smtp_server = config.get_user_option('smtp_server')
> > +        if self._smtp_server is None or self._smtp_server == 'localhost':
> > +            raise ValueError(
> > +                'To send email, a remotely accessible smtp_server (and '
> > +                'smtp_username and smtp_password, if necessary) must be '
> > +                'configured in bzr.  See the SMTP server information '
> > +                'here: https://wiki.canonical.com/EmailSetup .')
> > +        self._smtp_username = config.get_user_option('smtp_username')
> > +        self._smtp_password = config.get_user_option('smtp_password')
> > +        from_email = config.username()
> > +        if not from_email:
> > +            # XXX: JonathanLange 2009-10-04: Is this strictly true? I
> > +            # can't actually see where this is used.
> > +            raise ValueError(
> > +                'To send email, your bzr email address must be set '
> > +                '(use ``bzr whoami``).')
> >
> >          self._instance = instance
> >
> > @@ -315,28 +314,22 @@
> >      def configure_system(self):
> >          user_connection = self._instance.connect()
> >          as_user = user_connection.perform
> > -        # Set up bazaar.conf with smtp information if necessary
> > -        if self.email or self.message:
> > -            as_user('mkdir .bazaar')
> > -            bazaar_conf_file = user_connection.sftp.open(
> > -                ".bazaar/bazaar.conf", 'w')
> > -            bazaar_conf_file.write(
> > -                'smtp_server = %s\n' % (self._smtp_server,))
> > -            if self._smtp_username:
> > -                bazaar_conf_file.write(
> > -                    'smtp_username = %s\n' % (self._smtp_username,))
> > -            if self._smtp_password:
> > -                bazaar_conf_file.write(
> > -                    'smtp_password = %s\n' % (self._smtp_password,))
> > -            bazaar_conf_file.close()
> > -        # Copy remote ec2-remote over
> > -        self.log('Copying ec2test-remote.py to remote machine.\n')
> > -        user_connection.sftp.put(
> > -            os.path.join(os.path.dirname(os.path.realpath(__file__)),
> > -                         'ec2test-remote.py'),
> > -            '/var/launchpad/ec2test-remote.py')
> > +        # Set up bazaar.conf with smtp information.
> > +        as_user('mkdir .bazaar')
> > +        bazaar_conf_file = user_connection.sftp.open(
> > +            ".bazaar/bazaar.conf", 'w')
> > +        bazaar_conf_file.write(
> > +            'smtp_server = %s\n' % (self._smtp_server,))
> > +        if self._smtp_username:
> > +            bazaar_conf_file.write(
> > +                'smtp_username = %s\n' % (self._smtp_username,))
> > +        if self._smtp_password:
> > +            bazaar_conf_file.write(
> > +                'smtp_password = %s\n' % (self._smtp_password,))
> > +        bazaar_conf_file.close()
> >          # Set up launchpad login and email
> >          as_user('bzr launchpad-login %s' % (self._launchpad_login,))
> > +        # Done.
>
> As above, I can't see any more, but this *looks* like you
> conditionally set up _smtp_server & co (which makes sense, I think)
> but unconditionally using _smtp_server to set up bazaar.conf (which
> doesn't seem to make sense given the previous).

Both are unconditional, as far as I can tell. You're right to point
this out though; it may well break use-cases of the existing ec2 test
command. I've moved all the email setup stuff into a single method,
configure_email(), and run_tests() and run_node() are now responsible
for calling it.

>
> >          user_connection.close()
> >
> >      def prepare_tests(self):
> > @@ -451,6 +444,14 @@
> >          self.prepare_tests()
> >          user_connection = self._instance.connect()
> >
> > +        # Copy remote ec2-remote over.
> > +        self.log(
> > +            'Copying remote.py to ec2test-remote.py on remote machine.\n')
> > +        user_connection.sftp.put(
> > +            os.path.join(
> > +                os.path.dirname(os.path.realpath(__file__)), 'remote.py'),
> > +            '/var/launchpad/ec2test-remote.py')
>
> You know (and this isn't new) we actually have ec2test-remote.py on
> the remote system now, because we have a launchpad branch...

I want to always run the script from the branch I'm calling the ec2
command in. That way it makes it easy to use a development version, or
for people to revert to an older version if there are issues.

>
> >          # Make sure we activate the failsafe --shutdown feature.  This will
> >          # make the server shut itself down after the test run completes, or
> >          # if the test harness suffers a critical failure.
> > @@ -537,3 +538,43 @@
> >              # ec2test-remote.py wants the extra options to be after a double-
> >              # dash.
> >              return ('--', self.test_options)
> > +
> > +    def run_node(self):
> > +        """Start the test node."""
> > +        self.configure_system()
> > +        self.prepare_tests()
> > +        user_connection = self._instance.connect()
> > +        # Install Twisted.
> > +        user_connection.perform(
> > +            'sudo aptitude -y install'
> > +            ' python-twisted-core python-twisted-conch')
> > +        # Get Subunit. This is ugly, but I can't find a better way of
> > +        # doing it right now. Revision 90 is known to work.
> > +        user_connection.run_with_ssh_agent(
> > +            'bzr branch -r90 lp:subunit ~/subunit')
>
> Yeeeeeeeeeeeeeeeeech!
>
> I think it would be better to make a new image that has this stuff
> installed, at least the packages.  update-image should make this easy
> :-)

Well, we pull a lot of other branches, why not this one? :)

In seriousness, I know this is ugly, but it's a lot easier for now to
JFDI. Once this functionality has settled I agree it should go in the
image.

>
> > +        # Copy remotenode.py and remotenodekiller.py over.
> > +        self.log('Copying remotenode.py to remote machine.\n')
> > +        from_dir = os.path.dirname(os.path.realpath(__file__))
> > +        user_connection.sftp.put(
> > +            os.path.join(from_dir, 'remotenode.py'),
> > +            '/var/launchpad/remotenode.py')
> > +        user_connection.sftp.put(
> > +            os.path.join(from_dir, 'remotenodekiller.py'),
> > +            '/var/launchpad/remotenodekiller.py')
>
> This was what inspired the thought "these files are actually already
> there..." mentioned above.

Yeah, again, same answer.

>
> > +        # Start the remote node.
> > +        user_connection.perform(
> > +            'PYTHONPATH=~/subunit/python twistd'
> > +            ' --python /var/launchpad/remotenode.py'
> > +            ' --pidfile /var/launchpad/remotenode.pid'
> > +            ' --logfile /var/www/remotenode.log'
> > +            ' --umask 022')
> > +        # Daemonize the killer.
> > +        user_connection.run_as_daemon(
> > +            "python /var/launchpad/remotenodekiller.py"
> > +            " --pid-file /var/launchpad/remotenode.pid"
> > +            " --log-file-pattern '/var/www/*.log*'"
> > +            " -- sudo shutdown -h now")
> > +        # Remove the index.html from the www directory so we can see
> > +        # the logs being written therein.
> > +        user_connection.perform('rm /var/www/index.html')
> > +        user_connection.close()
>
> > === added file 'utilities/pb-shell'
> > --- utilities/pb-shell	1970-01-01 00:00:00 +0000
> > +++ utilities/pb-shell	2009-11-12 08:29:26 +0000
> > @@ -0,0 +1,89 @@
> > +#!/usr/bin/python
> > +"""A interactive PB (Perspective Broker) shell."""
> > +
> > +__metatype__ = type
> > +
> > +import code
> > +
> > +from functools import partial
> > +
> > +from twisted.internet import reactor, threads
> > +from twisted.spread import pb
> > +
> > +
> > +def block(func, *args, **kwargs):
> > +    """Run a function in the reactor and wait for the result.
> > +
> > +    If the result is a `pb.RemoteReference`, wrap it with a
> > +    `BlockingRemoteReference`.
> > +    """
> > +    result = threads.blockingCallFromThread(
> > +        reactor, func, *args, **kwargs)
> > +    if isinstance(result, pb.RemoteReference):
> > +        result = BlockingRemoteReference(result)
> > +    return result
> > +
> > +
> > +def blocking(wrapped):
> > +    """A decorator to turn an async PB call into a blocking one."""
> > +    return partial(block, wrapped)
> > +
> > +
> > +class BlockingRemoteReference:
> > +    """A blocking version of a `pb.RemoteReference`.
> > +
> > +    Accessing any attribute (excluding those starting with an
> > +    underscore) returns a blocking wrapper around a `callRemote` for
> > +    the given name.
> > +    """
> > +
> > +    def __init__(self, reference):
> > +        self.reference = reference
> > +
> > +    def __getattr__(self, name):
> > +        if name.startswith('_'):
> > +            raise AttributeError(name)
> > +        return blocking(
> > +            partial(self.reference.callRemote, name))
> > +
> > +
> > +def disaster(failure):
> > +    failure.printTraceback()
> > +    return failure
> > +
> > +
> > +def connect(host, port=8789):
> > +    """Connect to a PB server, returning the root object."""
> > +    # This function blocks, and is meant to be called from *outside*
> > +    # of the event loop.
> > +    def _connect():
> > +        """Connect to a PB server, returning the root object."""
> > +        factory = pb.PBClientFactory()
> > +        reactor.connectTCP(host, port, factory)
> > +        d = factory.getRootObject()
> > +        d.addErrback(disaster)
> > +        return d
> > +    return block(_connect)
> > +
> > +
> > +def interact(ns):
> > +    try:
> > +        from IPython.ipapi import launch_new_instance
> > +    except ImportError:
> > +        return code.interact(banner='', local=ns)
> > +    else:
> > +        return launch_new_instance(ns)
> > +
> > +
> > +def console():
> > +    """Start the interactive shell."""
> > +    commands = {'connect': connect}
> > +    try:
> > +        interact(commands)
> > +    finally:
> > +        reactor.callFromThread(reactor.stop)
> > +
> > +
> > +if __name__ == '__main__':
> > +    reactor.callInThread(console)
> > +    reactor.run()
>
> I hope to say more tomorrow.

Thank you *enormously* for looking at this.

I think I might have annoyed Jono last week for being a bit
intransigent with this stuff. It does look complicated, but I feel
like moved past the high point on the complexity curve, and there's a
lot of opportunity for simplification without losing the core reasons
why I think this approach is good, some or all of which are:

* Running headless and with a head are the same thing. It's just that
  there is no with-head implemented yet, but it can be bolted on
  without altering the important behaviour of the system.

* The interface to the a cluster of machines is quite clean, via
  PB. It could be XML-RPC, or something else, but it's advantageous to
  not be hacking command lines together I think, worrying about paths
  all the time, and shell escaping.

* Twisted makes the creation of a distributed system much more
  understandable and doable.

* It's more readily testable. Not that I've done so, but the intent is
  there.

* Having a twistd running on each worker and the supervisor means
  it'll be easier to add, say, a web interface to the test process, or
  be XMPP pinged when something goes horrible so that you can SSH in
  and investigate. Okay, this comes under "astronaut", but at least
  there is now a clean way to add these kinds of things.

Why am I defending it? I should just get it landed, and improve it
from there, and if it works well then it'll all be good in the end :)

Gavin.

lp:~allenap/launchpad/ec2-parry updated on 2009-12-23

9831. By Gavin Panella on 2009-11-12: Don't fail if the .bazaar directory already exists.
9832. By Gavin Panella on 2009-11-12: Prepending LC_ALL=C to commands either didn't have a full effect, or broke things. Reverting.
9833. By Gavin Panella on 2009-11-17: Merge devel.
9834. By Gavin Panella on 2009-11-18: twistd on Hardy does not recognize --umask.
9835. By Gavin Panella on 2009-11-19: DeferredLock.run() does not pass any args.
9836. By Gavin Panella on 2009-11-19: Create test parcels in order to reduce layer churn (also because the test suite seems to break a *lot* when run out of order). Keep track of new tests, so that _add_tests() can be called more than once (not that it needs to be, but this behaviour will be useful later on for retrying unrunnable tests), and keep a record of all tests. Remove some other noise from the test list obtained by running bin/test --list.
9837. By Gavin Panella on 2009-11-19: Rename self._tests to self._tests_to_run.
9838. By Gavin Panella on 2009-11-30: Merge devel.
9839. By Gavin Panella on 2009-12-02: Merge devel.
9840. By Gavin Panella on 2009-12-23: Merge devel.

Revision history for this message

Gavin Panella (allenap) wrote on 2009-12-23:

The code has moved on so much from this merge proposal, and moved into a separate branch, so I'm going to reject it.

Unmerged revisions

9841. By Gavin Panella on 2010-07-19: Merge devel, resolving several conflicts.
9840. By Gavin Panella on 2009-12-23: Merge devel.
9839. By Gavin Panella on 2009-12-02: Merge devel.
9838. By Gavin Panella on 2009-11-30: Merge devel.
9837. By Gavin Panella on 2009-11-19: Rename self._tests to self._tests_to_run.
9836. By Gavin Panella on 2009-11-19: Create test parcels in order to reduce layer churn (also because the test suite seems to break a *lot* when run out of order). Keep track of new tests, so that _add_tests() can be called more than once (not that it needs to be, but this behaviour will be useful later on for retrying unrunnable tests), and keep a record of all tests. Remove some other noise from the test list obtained by running bin/test --list.
9835. By Gavin Panella on 2009-11-19: DeferredLock.run() does not pass any args.
9834. By Gavin Panella on 2009-11-18: twistd on Hardy does not recognize --umask.
9833. By Gavin Panella on 2009-11-17: Merge devel.
9832. By Gavin Panella on 2009-11-12: Prepending LC_ALL=C to commands either didn't have a full effect, or broke things. Reverting.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Gavin Panella

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

Launchpad itself

Merge lp:~allenap/launchpad/ec2-parry into lp:launchpad

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers

 === modified file 'lib/devscripts/ec2test/account.py'
 --- lib/devscripts/ec2test/account.py	2009-10-08 14:50:19 +0000
 +++ lib/devscripts/ec2test/account.py	2009-12-23 15:23:24 +0000
@@ -104,6 +104,9 @@
          security_group.authorize('tcp', 22, 22, '%s/32' % ip)
          security_group.authorize('tcp', 80, 80, '%s/32' % ip)
          security_group.authorize('tcp', 443, 443, '%s/32' % ip)
++        # Authorize Perspective Broker and Subunit. XXX: This is way
++        # too permissive.
++        security_group.authorize('tcp', 8789, 8790, '0.0.0.0/0')
          for network in demo_networks:
              # Add missing netmask info for single ips.
              if '/' not in network:
 === modified file 'lib/devscripts/ec2test/builtins.py'
 --- lib/devscripts/ec2test/builtins.py	2009-11-30 16:26:24 +0000
 +++ lib/devscripts/ec2test/builtins.py	2009-12-23 15:23:24 +0000
@@ -18,11 +18,16 @@
  import socket
++from twisted.internet import reactor
++from twisted.internet.defer import DeferredList, DeferredLock
++from twisted.internet.threads import deferToThread
++
  from devscripts import get_launchpad_root
  from devscripts.ec2test.credentials import EC2Credentials
  from devscripts.ec2test.instance import (
      AVAILABLE_INSTANCE_TYPES, DEFAULT_INSTANCE_TYPE, EC2Instance)
++from devscripts.ec2test.remotenode import connect_to_node
  from devscripts.ec2test.session import EC2SessionName
  from devscripts.ec2test.testrunner import EC2TestRunner, TRUNK_BRANCH
@@ -306,6 +311,101 @@
          instance.set_up_and_run(postmortem, not headless, runner.run_tests)
++class cmd_parallel_test(EC2Command):
++    """Run the test suite in ec2 in parallel."""
++
++    takes_options = [
++        branch_option,
++        trunk_option,
++        machine_id_option,
++        instance_type_option,
++        postmortem_option,
++        include_download_cache_changes_option,
++        Option(
++            'jobs', short_name='j', type=int, argname="NUM",
++            help=('The number of instances to start and distribute the '
++                  'test command across.')),
++        ]
++
++    takes_args = ['test_branch?']
++
++    def run(self, test_branch=None, branch=None, trunk=False, machine=None,
++            instance_type=DEFAULT_INSTANCE_TYPE, postmortem=False,
++            include_download_cache_changes=False, jobs=1):
++        if branch is None:
++            branch = []
++        branches, test_branch = _get_branches_and_test_branch(
++            trunk, branch, test_branch)
++
++        if jobs < 1:
++            raise BzrCommandError(
++                'The number of instances must be greater than zero.')
++
++        # Keep a record of workers as they start.
++        workers = []
++
++        # This is fired once the supervisor has started.
++        supervisor_startup = DeferredLock()
++
++        def show_error(failure):
++            failure.printTraceback()
++
++        def start_instance_and_node(instance, runner):
++            # Bring up the instance and start the test node service.
++            try:
++                instance.start()
++                runner.run_node()
++            except:
++                instance.shutdown()
++                raise
++            else:
++                return instance
++
++        def get_root(instance):
++            return connect_to_node(instance.hostname)
++
++        def start_tests(root, instance):
++            workers.append((root, instance))
++            if len(workers) == 1:
++                # Start the supervisor.
++                supervisor_startup.run(
++                    root.callRemote, 'become_supervisor', jobs)
++            # Start the worker.
++            supervisor_root, supervisor_instance = workers[0]
++            # Wait for the supervisor to start.
++            d = supervisor_startup.run(lambda: None)
++            # Call the node *outside* of the lock.
++            return d.addCallback(lambda _: root.callRemote(
++                    'got_supervisor', supervisor_instance.hostname))
++
++        def create_and_start_instance():
++            session_name = EC2SessionName.make(EC2TestRunner.name)
++            instance = EC2Instance.make(
++                session_name, instance_type, machine)
++            runner = EC2TestRunner(
++                test_branch, branches=branches,
++                include_download_cache_changes=include_download_cache_changes,
++                instance=instance, launchpad_login=instance._launchpad_login)
++            # Do the startup in a thread because boto is
++            # blocking. XXX: Use txAWS here instead?
++            startup = deferToThread(start_instance_and_node, instance, runner)
++            startup.addCallback(get_root)
++            startup.addCallback(start_tests, instance)
++            return instance, startup
++
++        startups = []
++        for job in range(jobs):
++            instance, startup = create_and_start_instance()
++            startup.addErrback(show_error)
++            startups.append(startup)
++
++        # Stop when all the instances have been started.
++        started = DeferredList(startups)
++        started.addBoth(lambda _: reactor.stop())
++
++        reactor.run()
++
++
  class cmd_land(EC2Command):
      """Land a merge proposal on Launchpad."""
 === modified file 'lib/devscripts/ec2test/instance.py'
 --- lib/devscripts/ec2test/instance.py	2009-11-27 07:24:49 +0000
 +++ lib/devscripts/ec2test/instance.py	2009-12-23 15:23:24 +0000
@@ -404,8 +404,6 @@
      def set_up_and_run(self, postmortem, shutdown, func, *args, **kw):
          """Start, run `func` and then maybe shut down.
--        :param config: A dictionary specifying details of how the instance
--            should be run:
          :param postmortem: If true, any exceptions will be caught and an
              interactive session run to allow debugging the problem.
          :param shutdown: If true, shut down the instance after `func` and
@@ -574,6 +572,11 @@
          self._ssh = ssh
          self._sftp = None
++    def _command_with_locale(self, cmd):
++        # Default the locale to C to stop locale warnings (especially
++        # from bzr) from clogging up the output.
++        return 'export LC_ALL=C; ' + cmd
++
      @property
      def sftp(self):
          if self._sftp is None:
@@ -588,6 +591,7 @@
          :param out: A stream to write the output of the remote command to.
          :param err: A stream to write the error of the remote command to.
          """
++        cmd = self._command_with_locale(cmd)
          if out is None:
              out = sys.stdout
          if err is None:
@@ -623,12 +627,17 @@
              raise RuntimeError('Command failed: %s' % (cmd,))
          return res
++    def run_as_daemon(self, cmd):
++        """Start `cmd` as a daemonized process on the server."""
++        return self.perform('(%s </dev/null >/dev/null 2>/dev/null &)' % cmd)
++
      def run_with_ssh_agent(self, cmd, ignore_failure=False):
          """Run 'cmd' in a subprocess.
          Use this to run commands that require local SSH credentials. For
          example, getting private branches from Launchpad.
          """
++        cmd = self._command_with_locale(cmd)
          self._instance.log(
              '%s@%s$ %s\n'
              % (self._username, self._instance._boto_instance.id, cmd))
 === renamed file 'lib/devscripts/ec2test/ec2test-remote.py' => 'lib/devscripts/ec2test/remote.py'
 --- lib/devscripts/ec2test/ec2test-remote.py	2009-10-09 15:04:24 +0000
 +++ lib/devscripts/ec2test/remote.py	2009-12-23 15:23:24 +0000
@@ -30,7 +30,8 @@
  class BaseTestRunner:
      def __init__(self, email=None, pqm_message=None, public_branch=None,
--                 public_branch_revno=None, test_options=None):
++                 public_branch_revno=None, test_options=None,
++                 prefix=os.path.curdir):
          self.email = email
          self.pqm_message = pqm_message
          self.public_branch = public_branch
@@ -42,7 +43,8 @@
          self.test_options = test_options
          # Configure paths.
--        self.lp_dir = os.path.join(os.path.sep, 'var', 'launchpad')
++        self.prefix_dir = os.path.abspath(prefix)
++        self.lp_dir = os.path.join(self.prefix_dir, 'launchpad')
          self.tmp_dir = os.path.join(self.lp_dir, 'tmp')
          self.test_dir = os.path.join(self.lp_dir, 'test')
          self.sourcecode_dir = os.path.join(self.test_dir, 'sourcecode')
@@ -52,8 +54,8 @@
              self.test_dir,
              self.public_branch,
              self.public_branch_revno,
--            self.sourcecode_dir
--        )
++            self.sourcecode_dir,
++            os.path.join(self.prefix_dir, 'www'))
          # Daemonization options.
          self.pid_filename = os.path.join(self.lp_dir, 'ec2test-remote.pid')
@@ -88,7 +90,41 @@
          """
          raise NotImplementedError
--    def test(self):
++    def run_tests(self):
++        """Run the tests the good old fashioned way.
++
++        :return: A boolean indicating success.
++        """
++        call = self.build_test_command()
++
++        popen = subprocess.Popen(
++            call, bufsize=-1,
++            stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
++            cwd=self.test_dir)
++
++        self._gather_test_output(
++            popen, self.logger.summary_file, self.logger.out_file)
++
++        # Grab the testrunner exit status
++        result = popen.wait()
++
++        if self.pqm_message is not None:
++            subject = self.pqm_message.get('Subject')
++            if result:
++                # failure
++                self.logger.summary_file.write(
++                    '\n\n**NOT** submitted to PQM:\n%s\n' % (subject,))
++            else:
++                # success
++                conn = bzrlib.smtp_connection.SMTPConnection(
++                    bzrlib.config.GlobalConfig())
++                conn.send_email(self.pqm_message)
++                self.logger.summary_file.write(
++                    '\n\nSUBMITTED TO PQM:\n%s\n' % (subject,))
++
++        return (result == 0)
++
++    def run(self):
          """Run the tests, log the results.
          Signals the ec2test process and cleans up the logs once all the tests
@@ -99,43 +135,17 @@
          # os.fork() may have tried to close them.
          self.logger.prepare()
--        out_file     = self.logger.out_file
++        out_file = self.logger.out_file
          summary_file = self.logger.summary_file
--        config       = bzrlib.config.GlobalConfig()
--
--        call = self.build_test_command()
          try:
++            success = False
              try:
                  try:
--                    popen = subprocess.Popen(
--                        call, bufsize=-1,
--                        stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
--                        cwd=self.test_dir)
--
--                    self._gather_test_output(popen, summary_file, out_file)
--
--                    # Grab the testrunner exit status
--                    result = popen.wait()
--
--                    if self.pqm_message is not None:
--                        subject = self.pqm_message.get('Subject')
--                        if result:
--                            # failure
--                            summary_file.write(
--                                '\n\n**NOT** submitted to PQM:\n%s\n' %
--                                (subject,))
--                        else:
--                            # success
--                            conn = bzrlib.smtp_connection.SMTPConnection(
--                                config)
--                            conn.send_email(self.pqm_message)
--                            summary_file.write('\n\nSUBMITTED TO PQM:\n%s\n' %
--                                               (subject,))
++                    success = self.run_tests()
                  except:
                      summary_file.write('\n\nERROR IN TESTRUNNER\n\n')
                      traceback.print_exc(file=summary_file)
--                    result = 1
                      raise
              finally:
                  # It probably isn't safe to close the log files ourselves,
@@ -143,10 +153,11 @@
                  summary_file.close()
                  if self.email is not None:
                      subject = 'Test results: %s' % (
--                        result and 'FAILURE' or 'SUCCESS')
++                        success and 'SUCCESS' or 'FAILURE')
                      summary_file = open(self.logger.summary_filename, 'r')
                      bzrlib.email_message.EmailMessage.send(
--                        config, self.email[0], self.email,
++                        bzrlib.config.GlobalConfig(),
++                        self.email[0], self.email,
                          subject, summary_file.read())
                      summary_file.close()
          finally:
@@ -187,8 +198,7 @@
      def build_test_command(self):
          """See BaseTestRunner.build_test_command()."""
--        command = ['make', 'check', 'VERBOSITY=' + self.test_options]
--        return command
++        return ['make', 'check', 'VERBOSITY=' + self.test_options]
      # Used to filter lines in the summary log. See
      # `BaseTestRunner.ignore_line()`.
@@ -211,23 +221,22 @@
          # display, and return the exit code.  (See the xvfb-run man page for
          # details.)
          return [
--            'xvfb-run',
--            '-s', '-screen 0 1024x768x24',
--            'make', 'jscheck']
++            'xvfb-run', '-s', '-screen 0 1024x768x24',
++            'make', 'jscheck', 'VERBOSITY=' + self.test_options]
  class WebTestLogger:
      """Logs test output to disk and a simple web page."""
      def __init__(self, test_dir, public_branch, public_branch_revno,
--                 sourcecode_dir):
++                 sourcecode_dir, www_dir):
          """ Class initialiser """
          self.test_dir = test_dir
          self.public_branch = public_branch
          self.public_branch_revno = public_branch_revno
          self.sourcecode_dir = sourcecode_dir
++        self.www_dir = www_dir
--        self.www_dir = os.path.join(os.path.sep, 'var', 'www')
          self.out_filename = os.path.join(self.www_dir, 'current_test.log')
          self.summary_filename = os.path.join(self.www_dir, 'summary.log')
          self.index_filename = os.path.join(self.www_dir, 'index.html')
@@ -265,9 +274,9 @@
          """
          self.open_logs()
--        out_file     = self.out_file
++        out_file = self.out_file
          summary_file = self.summary_file
--        index_file   = self.index_file
++        index_file = self.index_file
          def write(msg):
              msg += '\n'
@@ -432,11 +441,16 @@
      parser.add_option(
          '--jscheck', dest='jscheck', default=False, action='store_true',
          help=('Run the JavaScript integration test suite.'))
++    parser.add_option(
++        '--prefix', dest='prefix', default=os.path.join(os.path.sep, 'var'),
++        help=('The directory in which to start the runner, '
++              '%default by default.'))
      options, args = parser.parse_args()
      if options.debug:
--        import pdb; pdb.set_trace()
++        import pdb
++        pdb.set_trace()
      if options.pqm_message is not None:
          pqm_message = pickle.loads(
              options.pqm_message.decode('string-escape').decode('base64'))
@@ -450,20 +464,19 @@
          runner_type = TestOnMergeRunner
      runner = runner_type(
--       options.email,
--       pqm_message,
--       options.public_branch,
--       options.public_branch_revno,
--       ' '.join(args)
--    )
++        email=options.email,
++        pqm_message=pqm_message,
++        public_branch=options.public_branch,
++        public_branch_revno=options.public_branch_revno,
++        test_options=' '.join(args),
++        prefix=options.prefix)
      try:
          try:
              if options.daemon:
                  print 'Starting testrunner daemon...'
                  runner.daemonize()
--
--            runner.test()
++            runner.run()
          except:
              # Handle exceptions thrown by the test() or daemonize() methods.
              if options.email:
 === added file 'lib/devscripts/ec2test/remotenode.py'
 --- lib/devscripts/ec2test/remotenode.py	1970-01-01 00:00:00 +0000
 +++ lib/devscripts/ec2test/remotenode.py	2009-12-23 15:23:24 +0000
@@ -0,0 +1,539 @@
++# Copyright 2009 Canonical Ltd.  This software is licensed under the
++# GNU Affero General Public License version 3 (see the file LICENSE).
++
++__metatype__ = type
++
++import os
++import tempfile
++import time
++
++from itertools import count, takewhile
++
++import bzrlib.config
++import bzrlib.email_message
++import bzrlib.smtp_connection
++import subunit
++
++from twisted.application import service, internet
++from twisted.cred import portal, checkers
++from twisted.conch import manhole, manhole_ssh
++from twisted.internet import defer
++from twisted.internet import error
++from twisted.internet import reactor
++from twisted.internet.protocol import (
++    ClientCreator, ProcessProtocol, Protocol, ServerFactory, connectionDone)
++from twisted.internet.threads import deferToThread
++from twisted.internet.utils import getProcessOutputAndValue
++from twisted.protocols.basic import LineOnlyReceiver
++from twisted.spread import pb
++from twisted.trial.reporter import TimingTextReporter
++
++
++# The port that nodes in the cluster listen on for PB commands.
++TEST_NODE_PORT = 8789
++
++# The port that the supervisor node listens on to receive test result data
++# from the workers.
++# XXX: Make this a property of the factory?
++TEST_STREAMING_PORT = 8790
++
++# SSH port for inspecting the running node.
++NODE_INSPECTION_PORT = 8791
++
++# For throwing stuff away.
++DEVNULL = open(os.devnull, 'rwb+')
++
++
++def calculate_parcel_size(total, num_parcels):
++    """Calculate the number of elements that should go into a parcel.
++
++    calculate_parcel_size(total, num_parcels) * num_parcels >= total
++    """
++    parcel_size = total // num_parcels
++    while parcel_size * num_parcels < total:
++        parcel_size += 1
++    return parcel_size
++
++
++def save_list(iterable, delimiter='\n'):
++    """Save an iterable of strings to a file, newline-separated."""
++    fd, filename = tempfile.mkstemp()
++    f = os.fdopen(fd, 'wb')
++    for line in iterable:
++        f.write(line + delimiter)
++    f.close()
++    return filename
++
++
++class CombiningSubunitFactory(ServerFactory):
++    """A factory for listening to subunit streams and collating results."""
++
++    def __init__(self, result, logger):
++        self._result = result
++        self._logger = logger
++
++    def buildProtocol(self, addr):
++        return SubunitProtocol(self._result, self._logger)
++
++
++class SubunitProtocolLogger:
++    """Organize logging for the SubunitProtocol."""
++
++    def __init__(self, log_dir):
++        self._log_dir = log_dir
++        self._log_nums = count(1)
++        self._logs = []
++
++    def open_log(self):
++        log_index = '%02d' % self._log_nums.next()
++        log_filename = os.path.join(
++            self._log_dir, 'subunit%s.log' % log_index)
++        log = open(log_filename, 'wb', buffering=1)
++        self._logs.append(log_filename)
++        return log
++
++    def close_log(self, log):
++        log.close()
++
++    @property
++    def logs(self):
++        return iter(self._logs)
++
++
++class SubunitProtocol(LineOnlyReceiver):
++    """An implementation of subunit in Twisted, yay!"""
++
++    # subunit separates lines with \n.
++    delimiter = '\n'
++
++    def __init__(self, result, logger):
++        self._result = result
++        self._logger = logger
++
++    def connectionMade(self):
++        self._subunit_log = self._logger.open_log()
++        self._subunit = subunit.TestProtocolServer(self._result, DEVNULL)
++
++    def lineReceived(self, line):
++        # subunit.TestProtocolServer expects the line-ending to be
++        # present.
++        line = line + self.delimiter
++        self._subunit_log.write(line)
++        self._subunit.lineReceived(line)
++
++    def connectionLost(self, reason=connectionDone):
++        self._subunit.lostConnection()
++        del self._subunit
++        self._logger.close_log(self._subunit_log)
++        del self._subunit_log
++
++
++class TestProcessProtocol(ProcessProtocol):
++
++    # XXX: TestProcessProtocol isn't a great name for this. It's really an
++    # adapter from a line receiver to a process protocol, plus a deferred that
++    # fires on process end. Doesn't have anything to do with tests really.
++
++    # XXX: Timeouts!
++
++    def __init__(self, deferred, remote_transport):
++        self._fire_when_done = deferred
++        self._remote_transport = remote_transport
++
++    def connectionMade(self):
++        print "Test process started."
++        self._checkpoint = time.time()
++
++    def outReceived(self, data):
++        self._remote_transport.write(data)
++        # Say something periodically to prevent watchdogs from killing us.
++        if time.time() - self._checkpoint > 60:
++            print "Test process running."
++            self._checkpoint = time.time()
++
++    def processEnded(self, reason):
++        if self._fire_when_done is not None:
++            d, self._fire_when_done = self._fire_when_done, None
++            if reason.check(error.ProcessDone):
++                d.callback(None)
++            else:
++                d.callback(reason)
++        self._remote_transport.loseConnection()
++
++
++class AccountingTestResult(TimingTextReporter):
++
++    def __init__(self, callback, stream):
++        self._callback = callback
++        super(AccountingTestResult, self).__init__(stream)
++
++    def stopTest(self, test):
++        self._callback(test)
++        super(AccountingTestResult, self).stopTest(test)
++
++
++def connect_to_node(address, port=TEST_NODE_PORT, reactor=reactor):
++    """Get the PB node at 'address'."""
++    factory = pb.PBClientFactory()
++    reactor.connectTCP(address, port, factory)
++    return factory.getRootObject()
++
++
++def connect_for_writing(reactor, host, port):
++    d = ClientCreator(reactor, Protocol).connectTCP(host, port)
++    return d.addCallback(lambda protocol: protocol.transport)
++
++
++def run_process(executable, args=(), path=None, reactor=None):
++    """Run a process and raise an error if it fails."""
++    print 'Running', executable, args
++
++    d = getProcessOutputAndValue(
++        executable, args=args, env=os.environ,
++        path=path, reactor=reactor)
++
++    def check_result((out, err, code)):
++        print 'Done'
++        if code != 0:
++            # XXX: Is this the best exception to raise?
++            raise RuntimeError(
++                "%s failed with code %d" % (executable, code), out, err)
++        return out
++
++    return d.addCallback(check_result)
++
++
++class TreeBuilder:
++    """Builds the tree once, guarded by a lock."""
++
++    # XXX: This could easily be made more general, i.e. ExecuteOnce. In fact,
++    # there probably is already something similar in Twisted.
++
++    def __init__(self, build_dir):
++        self._lock = defer.DeferredLock()
++        self._build_dir = build_dir
++        self._build_result = None
++
++    def _maybe_build(self):
++        if self._build_result is None:
++            build = run_process('/usr/bin/make', path=self._build_dir)
++            def built(result):
++                self._build_result = result
++            return build.addBoth(built)
++        else:
++            return self._build_result
++
++    def build(self):
++        return self._lock.run(self._maybe_build)
++
++
++class CloseableDeferredQueue(defer.DeferredQueue):
++    """A `DeferredQueue` that can be closed.
++
++    Once it has been closed by `queue.close()`, and once there are no
++    pending items in the queue, `queue.get()` will always return a
++    Deferred that has already been fired with None.
++    """
++
++    def __init__(self, size=None, backlog=None):
++        super(CloseableDeferredQueue, self).__init__(size, backlog)
++        self.closed = False
++
++    def close(self):
++        """Close this queue, and fire None to any waiting processes."""
++        self.closed = True
++        while self.waiting:
++            self.waiting.pop().callback(None)
++
++    def get(self):
++        d = super(CloseableDeferredQueue, self).get()
++        if self.closed and not d.called:
++            d.callback(None)
++        return d
++
++
++class TestNode(pb.Root):
++    """A node of a cluster that runs the Launchpad test suite."""
++
++    def __init__(self, service, branch_dir, log_dir):
++        self._service = service
++        self._branch_dir = branch_dir
++        self._log_dir = log_dir
++        self._is_supervisor = False
++        self._tree_builder = TreeBuilder(self._branch_dir)
++
++    def _run_process(self, executable, args=(), reactor=None):
++        """Run a process and raise an error if it fails.
++
++        Always runs the process in the `branch_dir`.
++        """
++        return run_process(executable, args, self._branch_dir, reactor)
++
++    def _find_tests(self):
++        """Return a list of tests that make up the Launchpad test suite.
++
++        :return: A Deferred that fires with a list of test ids.
++        """
++        d = self._tree_builder.build()
++        test_process = os.path.join(self._branch_dir, 'bin', 'test')
++        d.addCallback(
++            lambda ignored: self._run_process(test_process, ['--list']))
++        # ./bin/test --list dumps out all of the test ids with newline
++        # separation, but it adds a Total: at the end. There can also
++        # be import fascist warnings after that. Drop the total and
++        # everything after it so we don't confuse our callers.
++        p_testline = lambda line: not line.startswith('Total:')
++        return d.addCallback(
++            lambda output: list(
++                takewhile(p_testline, output.splitlines())))
++
++    def _add_tests(self, tests, num_parcels):
++        """Keep a record of the tests to be run.
++
++        The set of tests will be used to keep track of what tests have
++        not yet had results. Also, using a set de-duplicates the list.
++        """
++        tests = set(tests)
++        tests_new = tests - self._tests_all
++        # Sort the tests to reduce layer setup/teardown churn. The
++        # test suite also breaks a lot when run out of order.
++        tests_new_sorted = sorted(tests_new)
++        # Update state.
++        self._tests_all.update(tests_new)
++        self._tests_to_run.update(tests_new)
++        # Split up new tests into parcels.
++        parcel_size = calculate_parcel_size(len(tests), num_parcels)
++        for index in xrange(0, len(tests_new_sorted), parcel_size):
++            self._test_parcels.put(
++                tests_new_sorted[index:index+parcel_size])
++        # Tell everyone.
++        print ("%s new tests, split into %d parcels, each of at "
++               "most %d tests" % (len(tests_new), num_parcels, parcel_size))
++
++    def _calculate_work(self, num_parcels):
++        print 'Calculating work'
++        d = self._find_tests()
++        d.addCallback(self._add_tests, num_parcels)
++        return d
++
++    def remote_become_supervisor(self, num_parcels):
++        """Tell this node that it is a supervisor.
++
++        Note that supervisor nodes can still be workers, i.e. they can still
++        run tests.
++
++        :param num_parcels: The number of parcels in which to split the work
++            up into. For optimum sharing - assuming each parcel takes a
++            similar amount of time to process - this should be the same as the
++            number of workers, or a multiple thereof.
++        """
++        if self._is_supervisor:
++            return
++        self._is_supervisor = True
++        print 'Becoming supervisor'
++        # All tests.
++        self._tests_all = set()
++        # Tests that have not yet been run or attempted.
++        self._tests_to_run = set()
++        # Tests that have been attempted but for which there is no result.
++        self._tests_not_run = set()
++        # Parcels of tests to run on one go, for efficiency.
++        self._test_parcels = CloseableDeferredQueue()
++        # The result object.
++        self._test_result_log = os.path.join(self._log_dir, 'result.log')
++        self._test_result = AccountingTestResult(
++            callback=lambda test: self._tests_to_run.discard(test.id()),
++            stream=open(self._test_result_log, 'wb', buffering=1))
++        # Listen for subunit, pushing the results into the test
++        # result, and logging everything as we go along.
++        self._test_logger = SubunitProtocolLogger(self._log_dir)
++        result_service = internet.TCPServer(
++            TEST_STREAMING_PORT, CombiningSubunitFactory(
++                self._test_result, self._test_logger))
++        result_service.setServiceParent(self._service)
++        # At this point, we have been told how many parcels of work there
++        # should be, but we do not necessarily have any workers. Calculate
++        # each parcel of work. Work will be scheduled to workers as they ask
++        # for it.
++        self._calculate_work(num_parcels)
++
++    def remote_get_work(self):
++        """Called by workers when they are ready for work."""
++        assert self._is_supervisor
++        return self._test_parcels.get()
++
++    def remote_done_work(self, parcel):
++        """Called by workers to declare that they have successfully
++        completed a parcel of work.
++
++        This is used to check that we have at least attempted to run
++        every test, and also to trigger shutdown of the node.
++
++        Returns another parcel of work if there is any.
++        """
++        assert self._is_supervisor
++        self._tests_not_run.update(
++            self._tests_to_run.intersection(parcel))
++        self._tests_to_run.difference_update(self._tests_not_run)
++        if len(self._tests_to_run) == 0:
++            self._done_as_supervisor()
++        print ("Work done; parcel had %d tests; "
++               "%d tests remaining; %d cannot be run." % (
++                len(parcel), len(self._tests_to_run),
++                len(self._tests_not_run)))
++        return self._test_parcels.get()
++
++    def _do_work(self, tests):
++        # Connect to the reporting server.
++        connected = connect_for_writing(
++            reactor, self._supervisor_addr, TEST_STREAMING_PORT)
++        def run_tests(transport):
++            tests_finished = defer.Deferred()
++            protocol = TestProcessProtocol(tests_finished, transport)
++            command = (
++                '/usr/bin/make', '-o', 'clean', '-o', 'build', 'check',
++                'VERBOSITY=--subunit --load-list %s' % save_list(tests))
++            print "%s$ %s" % (self._branch_dir, command)
++            reactor.spawnProcess(
++                protocol, command[0], command,
++                env=os.environ, path=self._branch_dir)
++            return tests_finished
++        connected.addCallback(run_tests)
++        connected.addErrback(lambda failure: failure.printTraceback())
++        connected.addBoth(
++            lambda _: self._supervisor.callRemote('done_work', tests))
++        connected.addCallback(self._maybe_do_work)
++
++    def _maybe_do_work(self, tests):
++        if tests is None:
++            return self._done_as_worker()
++        else:
++            return self._do_work(tests)
++
++    def remote_got_supervisor(self, supervisor_addr):
++        """Called when the supervisor node is up and running."""
++        print 'Got supervisor', supervisor_addr
++        d = connect_to_node(supervisor_addr)
++        def got_supervisor(supervisor):
++            self._supervisor_addr = supervisor_addr
++            self._supervisor = supervisor
++        d.addCallback(got_supervisor)
++        d.addCallback(lambda _: self._tree_builder.build())
++        d.addCallback(lambda _: self._supervisor.callRemote('get_work'))
++        d.addCallback(self._maybe_do_work)
++
++    def _send_summary_email(self):
++        if self._test_result.wasSuccessful():
++            if len(self._tests_to_run) > 0 or len(self._tests_not_run) > 0:
++                subject = 'Test results: INCONCLUSIVE'
++            else:
++                subject = 'Test results: SUCCESS'
++        else:
++            subject = 'Tests results: FAILURE'
++
++        body = []
++
++        # Warn about tests that have not been run at all.
++        if len(self._tests_to_run) > 0:
++            body.append(
++                "Warning: %d tests were not attempted. See untested.txt "
++                "for the full list." % len(self._tests_to_run))
++
++        # Warn about tests that could not be run.
++        if len(self._tests_not_run) > 0:
++            body.append(
++                "Warning: %d tests could not be run. See unrunnable.txt "
++                "for the full list." % len(self._tests_not_run))
++
++        # Pad a bit.
++        body.extend(['', '--', ''])
++
++        # XXX: Change the email address :)
++        message = bzrlib.email_message.EmailMessage(
++            "gavin.panella@canonical.com",
++            ["gavin.panella@canonical.com"],
++            subject, "\n".join(body))
++
++        # Attach the results log.
++        message.add_inline_attachment(
++            open(self._test_result_log, 'rb').read(),
++            os.path.basename(self._test_result_log))
++
++        # Attach a list of tests that were not run.
++        if len(self._tests_to_run) > 0:
++            message.add_inline_attachment(
++                "\n".join(sorted(self._tests_to_run)), "untested.txt")
++
++        # Attach a list of tests that could not be run.
++        if len(self._tests_not_run) > 0:
++            message.add_inline_attachment(
++                "\n".join(sorted(self._tests_not_run)), "unrunnable.txt")
++
++        # Attach the subunit logs.
++        for log_filename in self._test_logger.logs:
++            message.add_inline_attachment(
++                open(log_filename, 'rb').read(),
++                os.path.basename(log_filename))
++
++        def send():
++            config = bzrlib.config.GlobalConfig()
++            connection = bzrlib.smtp_connection.SMTPConnection(config)
++            connection.send_email(message)
++        return deferToThread(send)
++
++    def _done_as_supervisor(self):
++        d = self._send_summary_email()
++        d.addErrback(lambda failure: failure.printTraceback())
++        d.addBoth(lambda _: self._shutdown())
++
++    def _done_as_worker(self):
++        if not self._is_supervisor:
++            self._shutdown()
++
++    def _shutdown(self):
++        # XXX: There's probably a better way of doing this, but this
++        # seems to work reliably for now.
++        return reactor.callLater(2, reactor.stop)
++
++
++def make_root(service, node_dir):
++    node_dir = os.path.abspath(node_dir)
++    branch_dir = os.path.join(node_dir, 'launchpad', 'test')
++    log_dir = os.path.join(node_dir, 'www')
++    # XXX: Might be better for the node service to actually implement
++    # IService.
++    return TestNode(service, branch_dir, log_dir)
++
++
++def make_node_service(node):
++    return internet.TCPServer(TEST_NODE_PORT, pb.PBServerFactory(node))
++
++
++def make_inspection_service(**namespace):
++    def getManholeFactory(ns):
++        realm = manhole_ssh.TerminalRealm()
++        def getManhole(_):
++            return manhole.ColoredManhole(ns)
++        realm.chainedProtocolFactory.protocolFactory = getManhole
++        p = portal.Portal(realm)
++        p.registerChecker(
++            checkers.InMemoryUsernamePasswordDatabaseDontUse(admin="admin"))
++        return manhole_ssh.ConchFactory(p)
++    return internet.TCPServer(
++        NODE_INSPECTION_PORT, getManholeFactory(namespace))
++
++
++# The Launchpad Test node application.
++application = service.Application("Launchpad Test Node")
++
++# The root, controller. It's a bit of a muddle right now.
++root = make_root(application, os.environ.get('NODE_DIR', '/var'))
++
++# The PB node service.
++node_service = make_node_service(root)
++node_service.setServiceParent(application)
++
++# The inspection service.
++inspection_service = make_inspection_service(
++    application=application, root=root)
++inspection_service.setServiceParent(application)
 === added file 'lib/devscripts/ec2test/remotenodekiller.py'
 --- lib/devscripts/ec2test/remotenodekiller.py	1970-01-01 00:00:00 +0000
 +++ lib/devscripts/ec2test/remotenodekiller.py	2009-12-23 15:23:24 +0000
@@ -0,0 +1,146 @@
++#!/usr/bin/python
++
++import optparse
++import sys
++import traceback
++
++from glob import glob
++from os import kill
++from os.path import exists, getmtime
++from subprocess import call
++from time import sleep, time
++
++
++def read_pid_from_file(filename):
++    """Read the pid from a file.
++
++    IO errors are allowed to propagate, but if the first line of the
++    file cannot converted to an int then None is returned, thus
++    permitting odd looking pid files.
++    """
++    fd = open(filename, 'rb')
++    try:
++        pid = fd.readline().strip()
++    finally:
++        fd.close()
++    try:
++        return int(pid)
++    except ValueError:
++        return None
++
++
++def watch(wait, pids, pid_files, log_file_patterns, expiry_age):
++    """Watch for changes in the given resources.
++
++    This checks:
++
++      * that all the specified processes are running,
++
++      * that all the specified pid files exist,
++
++      * that the process for which the pid file exists is running, if
++        the first line of the pid file is recognised as a pid (i.e. an
++        integer),
++
++      * that at least one log file exists for each given pattern,
++
++      * that at least one log file for each given pattern has been
++        written to recently.
++
++    """
++    try:
++        while True:
++            # Don't spin round.
++            sleep(wait)
++            # Check that all pids are receiving signals.
++            for pid in pids:
++                kill(pid, 0)
++            # If any pid file has gone, return.
++            for filename in pid_files:
++                if not exists(filename):
++                    return
++                # If a pid can be obtained from the file, check it's
++                # running.
++                pid = read_pid_from_file(filename)
++                if pid is not None:
++                    kill(pid, 0)
++            # Check log files.
++            expired = time() - expiry_age
++            for log_file_pattern in log_file_patterns:
++                log_files = glob(log_file_pattern)
++                # If no log files exist for this pattern, return.
++                if len(log_files) == 0:
++                    return
++                # If none of the log files have been written to
++                # recently, return.
++                log_file_max_mtime = max(
++                    getmtime(log_file) for log_file in log_files)
++                if log_file_max_mtime < expired:
++                    return
++    except KeyboardInterrupt:
++        raise
++    except:
++        traceback.print_exc()
++
++
++def main(args):
++    parser = optparse.OptionParser(
++        usage="%prog [options] command", description=(
++            "Run a arbitrary command when it appears that a process has "
++            "exited or stopped running."))
++    parser.add_option(
++        '--pid', action='append', dest='pids', type=int, help=(
++            "A pid to watch. If the process stops running, "
++            "the command is run."))
++    parser.add_option(
++        '--pid-file', action='append', dest='pid_files', help=(
++            "A pid file to watch. If any pid file disappears, "
++            "the command is run."))
++    parser.add_option(
++        '--log-file-pattern', action='append', dest='log_file_patterns',
++        help=(
++            "A log (or other) file to watch. If any log file disappears, "
++            "the command is run. If any log file gets old (see "
++            "--expiry-age), the command is run. Values specified here are "
++            "subject to glob expansion."))
++    parser.add_option(
++        '--expiry-age', action='store', dest='expiry_age', help=(
++            "The number of seconds since a log file was modified before "
++            "it is considered to be expired. Defaults to %default seconds."))
++    parser.add_option(
++        '--wait', action='store', type=int, dest='wait', help=(
++            "How long to wait between each check of the pids and logs. "
++            "Defaults to %default seconds."))
++    parser.set_defaults(expiry_age=600, wait=10)
++
++    options, command = parser.parse_args(args)
++
++    if options.pids is None:
++        options.pids = []
++    if options.pid_files is None:
++        options.pid_files = []
++    if options.log_file_patterns is None:
++        options.log_file_patterns = []
++
++    if (len(options.pids) + len(options.pid_files) +
++        len(options.log_file_patterns)) == 0:
++        parser.error("Nothing to watch.")
++    if len(command) == 0:
++        parser.error("No command to run.")
++    if options.expiry_age < 1:
++        parser.error("The expiry age must be at least 1 second.")
++    if options.wait < 1:
++        parser.error("The wait must be at least 1 second.")
++
++    watch(
++        options.wait, options.pids, options.pid_files,
++        options.log_file_patterns, options.expiry_age)
++
++    return call(command)
++
++
++if __name__ == '__main__':
++    try:
++        sys.exit(main(sys.argv[1:]))
++    except KeyboardInterrupt:
++        sys.exit(1)
 === modified file 'lib/devscripts/ec2test/testrunner.py'
 --- lib/devscripts/ec2test/testrunner.py	2009-12-01 22:53:47 +0000
 +++ lib/devscripts/ec2test/testrunner.py	2009-12-23 15:23:24 +0000
@@ -285,23 +285,10 @@
          self.email = email
          # Email configuration.
--        if email is not None or pqm_message is not None:
--            self._smtp_server = config.get_user_option('smtp_server')
--            if self._smtp_server is None or self._smtp_server == 'localhost':
--                raise ValueError(
--                    'To send email, a remotely accessible smtp_server (and '
--                    'smtp_username and smtp_password, if necessary) must be '
--                    'configured in bzr.  See the SMTP server information '
--                    'here: https://wiki.canonical.com/EmailSetup .')
--            self._smtp_username = config.get_user_option('smtp_username')
--            self._smtp_password = config.get_user_option('smtp_password')
--            from_email = config.username()
--            if not from_email:
--                # XXX: JonathanLange 2009-10-04: Is this strictly true? I
--                # can't actually see where this is used.
--                raise ValueError(
--                    'To send email, your bzr email address must be set '
--                    '(use ``bzr whoami``).')
++        self._smtp_server = config.get_user_option('smtp_server')
++        self._smtp_username = config.get_user_option('smtp_username')
++        self._smtp_password = config.get_user_option('smtp_password')
++        self._from_email = config.username()
          self._instance = instance
@@ -314,29 +301,40 @@
      def configure_system(self):
          user_connection = self._instance.connect()
--        as_user = user_connection.perform
--        # Set up bazaar.conf with smtp information if necessary
--        if self.email or self.message:
--            as_user('mkdir .bazaar')
--            bazaar_conf_file = user_connection.sftp.open(
--                ".bazaar/bazaar.conf", 'w')
--            bazaar_conf_file.write(
--                'smtp_server = %s\n' % (self._smtp_server,))
--            if self._smtp_username:
--                bazaar_conf_file.write(
--                    'smtp_username = %s\n' % (self._smtp_username,))
--            if self._smtp_password:
--                bazaar_conf_file.write(
--                    'smtp_password = %s\n' % (self._smtp_password,))
--            bazaar_conf_file.close()
--        # Copy remote ec2-remote over
--        self.log('Copying ec2test-remote.py to remote machine.\n')
--        user_connection.sftp.put(
--            os.path.join(os.path.dirname(os.path.realpath(__file__)),
--                         'ec2test-remote.py'),
--            '/var/launchpad/ec2test-remote.py')
--        # Set up launchpad login and email
--        as_user('bzr launchpad-login %s' % (self._launchpad_login,))
++        # Set up launchpad login.
++        user_connection.perform(
++            'bzr launchpad-login %s' % (self._launchpad_login,))
++        # Done.
++        user_connection.close()
++
++    def configure_email(self):
++        # Sanity checks.
++        if self._smtp_server is None or self._smtp_server == 'localhost':
++            raise ValueError(
++                'To send email, a remotely accessible smtp_server (and '
++                'smtp_username and smtp_password, if necessary) must be '
++                'configured in bzr.  See the SMTP server information '
++                'here: https://wiki.canonical.com/EmailSetup .')
++        if not self._from_email:
++            # XXX: JonathanLange 2009-10-04: Is this strictly true? I
++            # can't actually see where this is used.
++            raise ValueError(
++                'To send email, your bzr email address must be set '
++                '(use ``bzr whoami``).')
++        # Set up bazaar.conf with smtp information.
++        user_connection = self._instance.connect()
++        user_connection.perform('mkdir -p .bazaar')
++        bazaar_conf_file = user_connection.sftp.open(
++            ".bazaar/bazaar.conf", 'w')
++        bazaar_conf_file.write(
++            'smtp_server = %s\n' % (self._smtp_server,))
++        if self._smtp_username:
++            bazaar_conf_file.write(
++                'smtp_username = %s\n' % (self._smtp_username,))
++        if self._smtp_password:
++            bazaar_conf_file.write(
++                'smtp_password = %s\n' % (self._smtp_password,))
++        bazaar_conf_file.close()
          user_connection.close()
      def prepare_tests(self):
@@ -434,9 +432,19 @@
      def run_tests(self):
          self.configure_system()
++        if self.email is not None or self.message is not None:
++            self.configure_email()
          self.prepare_tests()
          user_connection = self._instance.connect()
++        # Copy remote ec2-remote over.
++        self.log(
++            'Copying remote.py to ec2test-remote.py on remote machine.\n')
++        user_connection.sftp.put(
++            os.path.join(
++                os.path.dirname(os.path.realpath(__file__)), 'remote.py'),
++            '/var/launchpad/ec2test-remote.py')
++
          # Make sure we activate the failsafe --shutdown feature.  This will
          # make the server shut itself down after the test run completes, or
          # if the test harness suffers a critical failure.
@@ -523,3 +531,43 @@
              # ec2test-remote.py wants the extra options to be after a double-
              # dash.
              return ('--', self.test_options)
++
++    def run_node(self):
++        """Start the test node."""
++        self.configure_system()
++        self.configure_email()
++        self.prepare_tests()
++        user_connection = self._instance.connect()
++        # Install Twisted.
++        user_connection.perform(
++            'sudo aptitude -y install'
++            ' python-twisted-core python-twisted-conch')
++        # Get Subunit. This is ugly, but I can't find a better way of
++        # doing it right now. Revision 90 is known to work.
++        user_connection.run_with_ssh_agent(
++            'bzr branch -r90 lp:subunit ~/subunit')
++        # Copy remotenode.py and remotenodekiller.py over.
++        self.log('Copying remotenode.py to remote machine.\n')
++        from_dir = os.path.dirname(os.path.realpath(__file__))
++        user_connection.sftp.put(
++            os.path.join(from_dir, 'remotenode.py'),
++            '/var/launchpad/remotenode.py')
++        user_connection.sftp.put(
++            os.path.join(from_dir, 'remotenodekiller.py'),
++            '/var/launchpad/remotenodekiller.py')
++        # Start the remote node.
++        user_connection.perform(
++            'PYTHONPATH=~/subunit/python twistd'
++            ' --python /var/launchpad/remotenode.py'
++            ' --pidfile /var/launchpad/remotenode.pid'
++            ' --logfile /var/www/remotenode.log')
++        # Daemonize the killer.
++        user_connection.run_as_daemon(
++            "python /var/launchpad/remotenodekiller.py"
++            " --pid-file /var/launchpad/remotenode.pid"
++            " --log-file-pattern '/var/www/*.log*'"
++            " -- sudo shutdown -h now")
++        # Remove the index.html from the www directory so we can see
++        # the logs being written therein.
++        user_connection.perform('rm /var/www/index.html')
++        user_connection.close()
 === added file 'utilities/pb-shell'
 --- utilities/pb-shell	1970-01-01 00:00:00 +0000
 +++ utilities/pb-shell	2009-12-23 15:23:24 +0000
@@ -0,0 +1,89 @@
++#!/usr/bin/python
++"""A interactive PB (Perspective Broker) shell."""
++
++__metatype__ = type
++
++import code
++
++from functools import partial
++
++from twisted.internet import reactor, threads
++from twisted.spread import pb
++
++
++def block(func, *args, **kwargs):
++    """Run a function in the reactor and wait for the result.
++
++    If the result is a `pb.RemoteReference`, wrap it with a
++    `BlockingRemoteReference`.
++    """
++    result = threads.blockingCallFromThread(
++        reactor, func, *args, **kwargs)
++    if isinstance(result, pb.RemoteReference):
++        result = BlockingRemoteReference(result)
++    return result
++
++
++def blocking(wrapped):
++    """A decorator to turn an async PB call into a blocking one."""
++    return partial(block, wrapped)
++
++
++class BlockingRemoteReference:
++    """A blocking version of a `pb.RemoteReference`.
++
++    Accessing any attribute (excluding those starting with an
++    underscore) returns a blocking wrapper around a `callRemote` for
++    the given name.
++    """
++
++    def __init__(self, reference):
++        self.reference = reference
++
++    def __getattr__(self, name):
++        if name.startswith('_'):
++            raise AttributeError(name)
++        return blocking(
++            partial(self.reference.callRemote, name))
++
++
++def disaster(failure):
++    failure.printTraceback()
++    return failure
++
++
++def connect(host, port=8789):
++    """Connect to a PB server, returning the root object."""
++    # This function blocks, and is meant to be called from *outside*
++    # of the event loop.
++    def _connect():
++        """Connect to a PB server, returning the root object."""
++        factory = pb.PBClientFactory()
++        reactor.connectTCP(host, port, factory)
++        d = factory.getRootObject()
++        d.addErrback(disaster)
++        return d
++    return block(_connect)
++
++
++def interact(ns):
++    try:
++        from IPython.ipapi import launch_new_instance
++    except ImportError:
++        return code.interact(banner='', local=ns)
++    else:
++        return launch_new_instance(ns)
++
++
++def console():
++    """Start the interactive shell."""
++    commands = {'connect': connect}
++    try:
++        interact(commands)
++    finally:
++        reactor.callFromThread(reactor.stop)
++
++
++if __name__ == '__main__':
++    reactor.callInThread(console)
++    reactor.run()