Merge into trunk : ideas : Code : Ubuntu CI Engine

Status:	Work in progress
Proposed branch:	lp:~vila/uci-engine/ideas
Merge into:	lp:uci-engine
Diff against target:	193 lines (+158/-1) 4 files modified .bzrignore (+4/-0) docs/Makefile (+8/-1) docs/architecture.rst (+134/-0) docs/images/ticket-worker.dot (+12/-0)
To merge this branch:	bzr merge lp:~vila/uci-engine/ideas
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Canonical CI Engineering		2014-04-01	Pending
Review via email: mp+213601@code.launchpad.net

Description of the change

Throwing ideas to fuel discussion, not proposing to merge.

Best read with:

$ (cd docs ; make html)
$ firefox docs/_build/html/architecture.html

so you get the pretty picture.

The ticket worker is intended to implement various workflows defining isolated tasks. It could replace jenkins usage by the lander and provide a more agile architecture.

The main targeted change is to define the API via the messages exchanged between the workers.

Revision history for this message

Francis Ginther (fginther) wrote on 2014-04-02:

#

This is nice and straight forward. Well thought out and described (not half baked like mine have been :-) ).

To perform different workflows, can I assume that the ticket worker is created with that knowledge as an input from the ticket system? For example, ticket 9 needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does this convey what tasks can be retried and which ones fail the ticket?

You mention "the task send an outgoing message listing the outputs to another queue." What (or who's) queue is this sent to? Is it a queue owned by this ticket worker? Or is the imagebuilder send a message to the test-runner's queue? Or...

I've also been thinking about the possibility of duplicates tasks. I think with a lot of what we're doing, there's the possibility of a worker or task running but it's unable to communicate and so just merrily proceeds. Meanwhile, the owner of the worker declares it dead and starts up a new one to replace it. I'm not confident we can completely eradicate duplicates so am considering how to live with these in the back of my mind.

Reply

Revision history for this message

Vincent Ladeuil (vila) wrote on 2014-04-02:

#

Download full text (3.4 KiB)

Thanks for the thoughtful review, I have some answers below and will
incorporate them in the proposal asap.

> To perform different workflows, can I assume that the ticket worker is created
> with that knowledge as an input from the ticket system? For example, ticket 9
> needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does
> this convey what tasks can be retried and which ones fail the ticket?

A ticket is associated with a single workflow. The workflow defines which
tasks should be done, their order and how/if they are retried. I first
thought that a task could only be retried, but I think we may want to allow
the case where a previous task is retried instead. For example, the test run
fails but we re-try the image building. I think we still want to keep the
workflow as an ordered list though.

> You mention "the task send an outgoing message listing the outputs to
> another queue." What (or who's) queue is this sent to?

Right, each worker has two queues:
- the incoming queue is shared across workers,
- the ouput queue is unique.

Just like we do today (and just like Evan did).

Or may be we don't need to make the output queue unique ? I.e. the ticket id
can be either part of the queue name or part of the message.

> Is it a queue owned by this ticket worker?

If it's unique, yes. Or rather, it's a queue between the ticket worker and
the task worker.

> Or is the imagebuilder send a message to the test-runner's queue?

No, task workers communicate via the ticket worker, never directly. That's
exactly the coupling we want to avoid.

> Or...

Or we define a single queue between the classes of workers instead of having
them specific to a ticket. Now that you've asked... I think this may be
simpler as it would significantly reduce the number of queues, making the
controller work easier.

>
> I've also been thinking about the possibility of duplicates tasks. I think
> with a lot of what we're doing, there's the possibility of a worker or task
> running but it's unable to communicate and so just merrily proceeds.

/me nods

> Meanwhile, the owner of the worker declares it dead and starts up a new
> one to replace it. I'm not confident we can completely eradicate
> duplicates so am considering how to live with these in the back of my
> mind.

We cannot completely avoid duplicate tasks at the system level so it's
possible (after some network issue or worker death) that the same task is
excuted twice in parallel or that some artifacts has already been created
(though we may want to put constraints on the task worker to upload
artifacts when the job is done to reduce potential issues).

In that case, the controller is responsible to ignore the duplicate when
detected.

I was thinking that the ticket worker would create a message with (ticket
id, task id, task number), with task number being incremented to make it
unique for a ticket. This gives a way to represent the multiple attempts.

But now that you mention duplicate taks, I realized this may not be enough
to create unique identifiers for the data store. So, the task worker should
create its artifacts with its node id (in addition to the above) and the
ticket worker will h...

Unmerged revisions

419. By Vincent Ladeuil on 2014-03-31: Add a state automaton diagram for the ticket worker.
418. By Vincent Ladeuil on 2014-03-29: Pointers to rabbit for the never failing cluster configuration.
417. By Vincent Ladeuil on 2014-03-29: Brain dump for phase-1.

 === modified file '.bzrignore'
 --- .bzrignore	2014-03-10 22:25:00 +0000
 +++ .bzrignore	2014-04-01 06:59:36 +0000
@@ -1,6 +1,10 @@
  # For all dependencies running from source
  ./branches/*
++# Where sphinx puts all its produced files
  ./docs/_build
++# We ignore the new .png files created for dot, so they will have to be bzr
++# added explictely.
++./docs/images/*.png
  .deps
  *.egg-info
  *.pyc
 === modified file 'docs/Makefile'
 --- docs/Makefile	2013-11-16 10:12:08 +0000
 +++ docs/Makefile	2014-04-01 06:59:36 +0000
@@ -38,10 +38,17 @@
  	@echo "  linkcheck  to check all external links for integrity"
  	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
++SCHEMAS=$(wildcard images/*.dot)
++PNGS=${SCHEMAS:images/%.dot=images/%.png}
++
++%.png : %.dot
++	dot -Tpng $< -o$@
++
  clean:
  	-rm -rf $(BUILDDIR)/*
--html:
++html:
++html: ${PNGS}
  	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
  	@echo
  	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
 === added file 'docs/architecture.rst'
 --- docs/architecture.rst	1970-01-01 00:00:00 +0000
 +++ docs/architecture.rst	2014-04-01 06:59:36 +0000
@@ -0,0 +1,134 @@
++======
++Engine
++======
++
++The engine accepts tickets that follow a task workflow. Each task is atomic
++and succeed or fail. Some tasks can be retried allowing the ticket to
++complete successfully. Other tasks can't be retried in which case the ticket
++fails.
++
++The engine outputs include:
++- binary packages,
++- images,
++- test failures,
++- logs and metrics associated with any of the above.
++
++========
++Workflow
++========
++
++A workflow describes a list of tasks that should succeed for a ticket to
++succeed. The ticket state represents the place where a ticket is in the
++workflow at a given point in time.
++
++A single ticket worker is responsible for a given ticket, schedules and
++monitors tasks according to the ticket workflow.
++
++From the ticket worker, a task can:
++
++- succeed and change the place of the ticket,
++
++- fail and not change the place of the ticket, if needed, a new task is
++  created,
++
++- hang and therefore not change the place of the ticket. The ticket worker
++  will kill the task when a timeout is reached, a killed task fails.
++
++The following state automaton captures the above definition:
++
++.. image:: images/ticket-worker.png
++
++
++
++The ticket worker owns the ticket and its state. A ticket state changes
++under the worker responsibility in a an atomic (and persistent) way when a
++task completes (success or failure).
++
++
++If a ticket worker dies, another ticket worker will takes ownership of the
++ticket and acquire the ticket state from the persistent storage.
++
++====
++Task
++====
++
++A task:
++
++- has a task id including the ticket-id,
++
++- a task receives an incoming message and produces an outgoing message,
++
++- acquire an incoming message uniquely defining a task including the task
++  id,
++
++- the task setup its environment from the message content only. If this
++  fails the incoming message is nacked,
++
++- the task does its core job (build a package, an image, run tests). If this
++  fails the incoming message is nacked,
++
++- the task upload its outputs (uniquely identified with the task id to
++  swift). If this fails, the incoming message is nacked,
++
++- the task send an outgoing message listing the outputs to another queue. If
++  that fails the incoming message is nacked,
++
++- the task tears down its environment. We don't care if that fails. If that
++  leads to a worker dieing, another worker will step up.
++
++
++========
++RabbitMQ
++========
++
++Rabbit provides support for a message store and forward protocol.
++
++This guarantees that no messages are lost once they enter the queues
++("store"). They also guarantees that a message is not stuck in queues as
++long as consumers exist or appear after a reasonable time ("forward").
++
++We have two use cases:
++
++- a single server that never fail,
++
++- a cluster of servers that never fail
++  ([http://www.rabbitmq.com/ha.html|High availability]], one AZ, no
++  [[|http://www.rabbitmq.com/partitions.html|net partitions]]).
++
++The first one is what we have for phase-0 and is enough for most of our
++tests. We'll need some specific tests to cover the scenarios we care about
++in the cluster case.
++
++The outcome is that we can rely on the following properties:
++
++- a message that entered a queue will never be lost,
++
++- a message that left a queue will never be lost.
++
++The later case has two applications:
++
++- a output message guarantees that a task is done, if that fails the message
++  stays in the queue,
++
++- a worker acquiring an output message will always produce an input message
++  in another queue. If that fails the output message stays in the
++  queue. There is a caveat here as the input message in the other queue
++  won't be deleted if the output message cannot be acked.
++
++In summary, while we have the guaranty that a message will never be lost, we
++may encounter cases where duplicate messages appear in the system.
++
++To address the duplicate messages we need a way to identify their intent
++uniquely. In our case, this is the ticket id and the task id.
++
++At the workflow level, for a given ticket, we can identify and ignore
++duplicate messages.
++
++=====
++Swift
++=====
++
++Tasks produce artifacts, logs and results that are stored securely in swift.
++
++If a task fails to upload an object, it fails and nacks its incoming message.
++
 === added file 'docs/images/ticket-worker.dot'
 --- docs/images/ticket-worker.dot	1970-01-01 00:00:00 +0000
 +++ docs/images/ticket-worker.dot	2014-04-01 06:59:36 +0000
@@ -0,0 +1,12 @@
++digraph "test worker state automaton" {
++        "created" [peripheries=2]
++        "done" [peripheries=2]
++        "created" -> "started" [label="setup"]
++        "started" -> "waiting on task" [label="do_task"]
++        "waiting on task" -> "task succeeded" [label="success"]
++        "task succeeded" -> "done" [label="tears down"]
++        "waiting on task" -> "task failed" [label="fails"]
++        "waiting on task" -> "task failed" [label="task_times_out"]
++        "task failed" -> "waiting on task" [label="do_task"]
++        "task failed" -> "done" [label="tears_down"]
++}

Ubuntu CI Engine

Merge lp:~vila/uci-engine/ideas into lp:uci-engine

Commit message

Description of the change

Unmerged revisions

Preview Diff

Subscribers