Merge lp:~vila/uci-engine/ideas into lp:uci-engine

Proposed by Vincent Ladeuil
Status: Work in progress
Proposed branch: lp:~vila/uci-engine/ideas
Merge into: lp:uci-engine
Diff against target: 193 lines (+158/-1)
4 files modified
.bzrignore (+4/-0)
docs/Makefile (+8/-1)
docs/architecture.rst (+134/-0)
docs/images/ticket-worker.dot (+12/-0)
To merge this branch: bzr merge lp:~vila/uci-engine/ideas
Reviewer Review Type Date Requested Status
Canonical CI Engineering Pending
Review via email: mp+213601@code.launchpad.net

Description of the change

Throwing ideas to fuel discussion, not proposing to merge.

Best read with:

$ (cd docs ; make html)
$ firefox docs/_build/html/architecture.html

so you get the pretty picture.

The ticket worker is intended to implement various workflows defining isolated tasks. It could replace jenkins usage by the lander and provide a more agile architecture.

The main targeted change is to define the API via the messages exchanged between the workers.

To post a comment you must log in.
Revision history for this message
Francis Ginther (fginther) wrote :

This is nice and straight forward. Well thought out and described (not half baked like mine have been :-) ).

To perform different workflows, can I assume that the ticket worker is created with that knowledge as an input from the ticket system? For example, ticket 9 needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does this convey what tasks can be retried and which ones fail the ticket?

You mention "the task send an outgoing message listing the outputs to another queue." What (or who's) queue is this sent to? Is it a queue owned by this ticket worker? Or is the imagebuilder send a message to the test-runner's queue? Or...

I've also been thinking about the possibility of duplicates tasks. I think with a lot of what we're doing, there's the possibility of a worker or task running but it's unable to communicate and so just merrily proceeds. Meanwhile, the owner of the worker declares it dead and starts up a new one to replace it. I'm not confident we can completely eradicate duplicates so am considering how to live with these in the back of my mind.

Revision history for this message
Vincent Ladeuil (vila) wrote :
Download full text (3.4 KiB)

Thanks for the thoughtful review, I have some answers below and will
incorporate them in the proposal asap.

> To perform different workflows, can I assume that the ticket worker is created
> with that knowledge as an input from the ticket system? For example, ticket 9
> needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does
> this convey what tasks can be retried and which ones fail the ticket?

A ticket is associated with a single workflow. The workflow defines which
tasks should be done, their order and how/if they are retried. I first
thought that a task could only be retried, but I think we may want to allow
the case where a previous task is retried instead. For example, the test run
fails but we re-try the image building. I think we still want to keep the
workflow as an ordered list though.

> You mention "the task send an outgoing message listing the outputs to
> another queue." What (or who's) queue is this sent to?

Right, each worker has two queues:
- the incoming queue is shared across workers,
- the ouput queue is unique.

Just like we do today (and just like Evan did).

Or may be we don't need to make the output queue unique ? I.e. the ticket id
can be either part of the queue name or part of the message.

> Is it a queue owned by this ticket worker?

If it's unique, yes. Or rather, it's a queue between the ticket worker and
the task worker.

> Or is the imagebuilder send a message to the test-runner's queue?

No, task workers communicate via the ticket worker, never directly. That's
exactly the coupling we want to avoid.

> Or...

Or we define a single queue between the classes of workers instead of having
them specific to a ticket. Now that you've asked... I think this may be
simpler as it would significantly reduce the number of queues, making the
controller work easier.

>
> I've also been thinking about the possibility of duplicates tasks. I think
> with a lot of what we're doing, there's the possibility of a worker or task
> running but it's unable to communicate and so just merrily proceeds.

/me nods

> Meanwhile, the owner of the worker declares it dead and starts up a new
> one to replace it. I'm not confident we can completely eradicate
> duplicates so am considering how to live with these in the back of my
> mind.

We cannot completely avoid duplicate tasks at the system level so it's
possible (after some network issue or worker death) that the same task is
excuted twice in parallel or that some artifacts has already been created
(though we may want to put constraints on the task worker to upload
artifacts when the job is done to reduce potential issues).

In that case, the controller is responsible to ignore the duplicate when
detected.

I was thinking that the ticket worker would create a message with (ticket
id, task id, task number), with task number being incremented to make it
unique for a ticket. This gives a way to represent the multiple attempts.

But now that you mention duplicate taks, I realized this may not be enough
to create unique identifiers for the data store. So, the task worker should
create its artifacts with its node id (in addition to the above) and the
ticket worker will h...

Read more...

Unmerged revisions

419. By Vincent Ladeuil

Add a state automaton diagram for the ticket worker.

418. By Vincent Ladeuil

Pointers to rabbit for the never failing cluster configuration.

417. By Vincent Ladeuil

Brain dump for phase-1.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file '.bzrignore'
--- .bzrignore 2014-03-10 22:25:00 +0000
+++ .bzrignore 2014-04-01 06:59:36 +0000
@@ -1,6 +1,10 @@
1# For all dependencies running from source1# For all dependencies running from source
2./branches/*2./branches/*
3# Where sphinx puts all its produced files
3./docs/_build4./docs/_build
5# We ignore the new .png files created for dot, so they will have to be bzr
6# added explictely.
7./docs/images/*.png
4.deps8.deps
5*.egg-info9*.egg-info
6*.pyc10*.pyc
711
=== modified file 'docs/Makefile'
--- docs/Makefile 2013-11-16 10:12:08 +0000
+++ docs/Makefile 2014-04-01 06:59:36 +0000
@@ -38,10 +38,17 @@
38 @echo " linkcheck to check all external links for integrity"38 @echo " linkcheck to check all external links for integrity"
39 @echo " doctest to run all doctests embedded in the documentation (if enabled)"39 @echo " doctest to run all doctests embedded in the documentation (if enabled)"
4040
41SCHEMAS=$(wildcard images/*.dot)
42PNGS=${SCHEMAS:images/%.dot=images/%.png}
43
44%.png : %.dot
45 dot -Tpng $< -o$@
46
41clean:47clean:
42 -rm -rf $(BUILDDIR)/*48 -rm -rf $(BUILDDIR)/*
4349
44html:50html:
51html: ${PNGS}
45 $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html52 $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
46 @echo53 @echo
47 @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."54 @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
4855
=== added file 'docs/architecture.rst'
--- docs/architecture.rst 1970-01-01 00:00:00 +0000
+++ docs/architecture.rst 2014-04-01 06:59:36 +0000
@@ -0,0 +1,134 @@
1======
2Engine
3======
4
5The engine accepts tickets that follow a task workflow. Each task is atomic
6and succeed or fail. Some tasks can be retried allowing the ticket to
7complete successfully. Other tasks can't be retried in which case the ticket
8fails.
9
10The engine outputs include:
11- binary packages,
12- images,
13- test failures,
14- logs and metrics associated with any of the above.
15
16========
17Workflow
18========
19
20A workflow describes a list of tasks that should succeed for a ticket to
21succeed. The ticket state represents the place where a ticket is in the
22workflow at a given point in time.
23
24A single ticket worker is responsible for a given ticket, schedules and
25monitors tasks according to the ticket workflow.
26
27From the ticket worker, a task can:
28
29- succeed and change the place of the ticket,
30
31- fail and not change the place of the ticket, if needed, a new task is
32 created,
33
34- hang and therefore not change the place of the ticket. The ticket worker
35 will kill the task when a timeout is reached, a killed task fails.
36
37The following state automaton captures the above definition:
38
39.. image:: images/ticket-worker.png
40
41
42
43The ticket worker owns the ticket and its state. A ticket state changes
44under the worker responsibility in a an atomic (and persistent) way when a
45task completes (success or failure).
46
47
48If a ticket worker dies, another ticket worker will takes ownership of the
49ticket and acquire the ticket state from the persistent storage.
50
51====
52Task
53====
54
55A task:
56
57- has a task id including the ticket-id,
58
59- a task receives an incoming message and produces an outgoing message,
60
61- acquire an incoming message uniquely defining a task including the task
62 id,
63
64- the task setup its environment from the message content only. If this
65 fails the incoming message is nacked,
66
67- the task does its core job (build a package, an image, run tests). If this
68 fails the incoming message is nacked,
69
70- the task upload its outputs (uniquely identified with the task id to
71 swift). If this fails, the incoming message is nacked,
72
73- the task send an outgoing message listing the outputs to another queue. If
74 that fails the incoming message is nacked,
75
76- the task tears down its environment. We don't care if that fails. If that
77 leads to a worker dieing, another worker will step up.
78
79
80========
81RabbitMQ
82========
83
84Rabbit provides support for a message store and forward protocol.
85
86This guarantees that no messages are lost once they enter the queues
87("store"). They also guarantees that a message is not stuck in queues as
88long as consumers exist or appear after a reasonable time ("forward").
89
90We have two use cases:
91
92- a single server that never fail,
93
94- a cluster of servers that never fail
95 ([http://www.rabbitmq.com/ha.html|High availability]], one AZ, no
96 [[|http://www.rabbitmq.com/partitions.html|net partitions]]).
97
98The first one is what we have for phase-0 and is enough for most of our
99tests. We'll need some specific tests to cover the scenarios we care about
100in the cluster case.
101
102The outcome is that we can rely on the following properties:
103
104- a message that entered a queue will never be lost,
105
106- a message that left a queue will never be lost.
107
108The later case has two applications:
109
110- a output message guarantees that a task is done, if that fails the message
111 stays in the queue,
112
113- a worker acquiring an output message will always produce an input message
114 in another queue. If that fails the output message stays in the
115 queue. There is a caveat here as the input message in the other queue
116 won't be deleted if the output message cannot be acked.
117
118In summary, while we have the guaranty that a message will never be lost, we
119may encounter cases where duplicate messages appear in the system.
120
121To address the duplicate messages we need a way to identify their intent
122uniquely. In our case, this is the ticket id and the task id.
123
124At the workflow level, for a given ticket, we can identify and ignore
125duplicate messages.
126
127=====
128Swift
129=====
130
131Tasks produce artifacts, logs and results that are stored securely in swift.
132
133If a task fails to upload an object, it fails and nacks its incoming message.
134
0135
=== added file 'docs/images/ticket-worker.dot'
--- docs/images/ticket-worker.dot 1970-01-01 00:00:00 +0000
+++ docs/images/ticket-worker.dot 2014-04-01 06:59:36 +0000
@@ -0,0 +1,12 @@
1digraph "test worker state automaton" {
2 "created" [peripheries=2]
3 "done" [peripheries=2]
4 "created" -> "started" [label="setup"]
5 "started" -> "waiting on task" [label="do_task"]
6 "waiting on task" -> "task succeeded" [label="success"]
7 "task succeeded" -> "done" [label="tears down"]
8 "waiting on task" -> "task failed" [label="fails"]
9 "waiting on task" -> "task failed" [label="task_times_out"]
10 "task failed" -> "waiting on task" [label="do_task"]
11 "task failed" -> "done" [label="tears_down"]
12}

Subscribers

People subscribed via source and target branches