Merge lp:~vila/uci-engine/ideas into lp:uci-engine

Proposed by Vincent Ladeuil
Status: Work in progress
Proposed branch: lp:~vila/uci-engine/ideas
Merge into: lp:uci-engine
Diff against target: 193 lines (+158/-1)
4 files modified
.bzrignore (+4/-0)
docs/Makefile (+8/-1)
docs/architecture.rst (+134/-0)
docs/images/ticket-worker.dot (+12/-0)
To merge this branch: bzr merge lp:~vila/uci-engine/ideas
Reviewer Review Type Date Requested Status
Canonical CI Engineering Pending
Review via email: mp+213601@code.launchpad.net

Description of the change

Throwing ideas to fuel discussion, not proposing to merge.

Best read with:

$ (cd docs ; make html)
$ firefox docs/_build/html/architecture.html

so you get the pretty picture.

The ticket worker is intended to implement various workflows defining isolated tasks. It could replace jenkins usage by the lander and provide a more agile architecture.

The main targeted change is to define the API via the messages exchanged between the workers.

To post a comment you must log in.
Revision history for this message
Francis Ginther (fginther) wrote :

This is nice and straight forward. Well thought out and described (not half baked like mine have been :-) ).

To perform different workflows, can I assume that the ticket worker is created with that knowledge as an input from the ticket system? For example, ticket 9 needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does this convey what tasks can be retried and which ones fail the ticket?

You mention "the task send an outgoing message listing the outputs to another queue." What (or who's) queue is this sent to? Is it a queue owned by this ticket worker? Or is the imagebuilder send a message to the test-runner's queue? Or...

I've also been thinking about the possibility of duplicates tasks. I think with a lot of what we're doing, there's the possibility of a worker or task running but it's unable to communicate and so just merrily proceeds. Meanwhile, the owner of the worker declares it dead and starts up a new one to replace it. I'm not confident we can completely eradicate duplicates so am considering how to live with these in the back of my mind.

Revision history for this message
Vincent Ladeuil (vila) wrote :
Download full text (3.4 KiB)

Thanks for the thoughtful review, I have some answers below and will
incorporate them in the proposal asap.

> To perform different workflows, can I assume that the ticket worker is created
> with that knowledge as an input from the ticket system? For example, ticket 9
> needs to perform task A/B/C and ticket 10 needs to do just A/B. Also, does
> this convey what tasks can be retried and which ones fail the ticket?

A ticket is associated with a single workflow. The workflow defines which
tasks should be done, their order and how/if they are retried. I first
thought that a task could only be retried, but I think we may want to allow
the case where a previous task is retried instead. For example, the test run
fails but we re-try the image building. I think we still want to keep the
workflow as an ordered list though.

> You mention "the task send an outgoing message listing the outputs to
> another queue." What (or who's) queue is this sent to?

Right, each worker has two queues:
- the incoming queue is shared across workers,
- the ouput queue is unique.

Just like we do today (and just like Evan did).

Or may be we don't need to make the output queue unique ? I.e. the ticket id
can be either part of the queue name or part of the message.

> Is it a queue owned by this ticket worker?

If it's unique, yes. Or rather, it's a queue between the ticket worker and
the task worker.

> Or is the imagebuilder send a message to the test-runner's queue?

No, task workers communicate via the ticket worker, never directly. That's
exactly the coupling we want to avoid.

> Or...

Or we define a single queue between the classes of workers instead of having
them specific to a ticket. Now that you've asked... I think this may be
simpler as it would significantly reduce the number of queues, making the
controller work easier.

>
> I've also been thinking about the possibility of duplicates tasks. I think
> with a lot of what we're doing, there's the possibility of a worker or task
> running but it's unable to communicate and so just merrily proceeds.

/me nods

> Meanwhile, the owner of the worker declares it dead and starts up a new
> one to replace it. I'm not confident we can completely eradicate
> duplicates so am considering how to live with these in the back of my
> mind.

We cannot completely avoid duplicate tasks at the system level so it's
possible (after some network issue or worker death) that the same task is
excuted twice in parallel or that some artifacts has already been created
(though we may want to put constraints on the task worker to upload
artifacts when the job is done to reduce potential issues).

In that case, the controller is responsible to ignore the duplicate when
detected.

I was thinking that the ticket worker would create a message with (ticket
id, task id, task number), with task number being incremented to make it
unique for a ticket. This gives a way to represent the multiple attempts.

But now that you mention duplicate taks, I realized this may not be enough
to create unique identifiers for the data store. So, the task worker should
create its artifacts with its node id (in addition to the above) and the
ticket worker will h...

Read more...

Unmerged revisions

419. By Vincent Ladeuil

Add a state automaton diagram for the ticket worker.

418. By Vincent Ladeuil

Pointers to rabbit for the never failing cluster configuration.

417. By Vincent Ladeuil

Brain dump for phase-1.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file '.bzrignore'
2--- .bzrignore 2014-03-10 22:25:00 +0000
3+++ .bzrignore 2014-04-01 06:59:36 +0000
4@@ -1,6 +1,10 @@
5 # For all dependencies running from source
6 ./branches/*
7+# Where sphinx puts all its produced files
8 ./docs/_build
9+# We ignore the new .png files created for dot, so they will have to be bzr
10+# added explictely.
11+./docs/images/*.png
12 .deps
13 *.egg-info
14 *.pyc
15
16=== modified file 'docs/Makefile'
17--- docs/Makefile 2013-11-16 10:12:08 +0000
18+++ docs/Makefile 2014-04-01 06:59:36 +0000
19@@ -38,10 +38,17 @@
20 @echo " linkcheck to check all external links for integrity"
21 @echo " doctest to run all doctests embedded in the documentation (if enabled)"
22
23+SCHEMAS=$(wildcard images/*.dot)
24+PNGS=${SCHEMAS:images/%.dot=images/%.png}
25+
26+%.png : %.dot
27+ dot -Tpng $< -o$@
28+
29 clean:
30 -rm -rf $(BUILDDIR)/*
31
32-html:
33+html:
34+html: ${PNGS}
35 $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
36 @echo
37 @echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
38
39=== added file 'docs/architecture.rst'
40--- docs/architecture.rst 1970-01-01 00:00:00 +0000
41+++ docs/architecture.rst 2014-04-01 06:59:36 +0000
42@@ -0,0 +1,134 @@
43+======
44+Engine
45+======
46+
47+The engine accepts tickets that follow a task workflow. Each task is atomic
48+and succeed or fail. Some tasks can be retried allowing the ticket to
49+complete successfully. Other tasks can't be retried in which case the ticket
50+fails.
51+
52+The engine outputs include:
53+- binary packages,
54+- images,
55+- test failures,
56+- logs and metrics associated with any of the above.
57+
58+========
59+Workflow
60+========
61+
62+A workflow describes a list of tasks that should succeed for a ticket to
63+succeed. The ticket state represents the place where a ticket is in the
64+workflow at a given point in time.
65+
66+A single ticket worker is responsible for a given ticket, schedules and
67+monitors tasks according to the ticket workflow.
68+
69+From the ticket worker, a task can:
70+
71+- succeed and change the place of the ticket,
72+
73+- fail and not change the place of the ticket, if needed, a new task is
74+ created,
75+
76+- hang and therefore not change the place of the ticket. The ticket worker
77+ will kill the task when a timeout is reached, a killed task fails.
78+
79+The following state automaton captures the above definition:
80+
81+.. image:: images/ticket-worker.png
82+
83+
84+
85+The ticket worker owns the ticket and its state. A ticket state changes
86+under the worker responsibility in a an atomic (and persistent) way when a
87+task completes (success or failure).
88+
89+
90+If a ticket worker dies, another ticket worker will takes ownership of the
91+ticket and acquire the ticket state from the persistent storage.
92+
93+====
94+Task
95+====
96+
97+A task:
98+
99+- has a task id including the ticket-id,
100+
101+- a task receives an incoming message and produces an outgoing message,
102+
103+- acquire an incoming message uniquely defining a task including the task
104+ id,
105+
106+- the task setup its environment from the message content only. If this
107+ fails the incoming message is nacked,
108+
109+- the task does its core job (build a package, an image, run tests). If this
110+ fails the incoming message is nacked,
111+
112+- the task upload its outputs (uniquely identified with the task id to
113+ swift). If this fails, the incoming message is nacked,
114+
115+- the task send an outgoing message listing the outputs to another queue. If
116+ that fails the incoming message is nacked,
117+
118+- the task tears down its environment. We don't care if that fails. If that
119+ leads to a worker dieing, another worker will step up.
120+
121+
122+========
123+RabbitMQ
124+========
125+
126+Rabbit provides support for a message store and forward protocol.
127+
128+This guarantees that no messages are lost once they enter the queues
129+("store"). They also guarantees that a message is not stuck in queues as
130+long as consumers exist or appear after a reasonable time ("forward").
131+
132+We have two use cases:
133+
134+- a single server that never fail,
135+
136+- a cluster of servers that never fail
137+ ([http://www.rabbitmq.com/ha.html|High availability]], one AZ, no
138+ [[|http://www.rabbitmq.com/partitions.html|net partitions]]).
139+
140+The first one is what we have for phase-0 and is enough for most of our
141+tests. We'll need some specific tests to cover the scenarios we care about
142+in the cluster case.
143+
144+The outcome is that we can rely on the following properties:
145+
146+- a message that entered a queue will never be lost,
147+
148+- a message that left a queue will never be lost.
149+
150+The later case has two applications:
151+
152+- a output message guarantees that a task is done, if that fails the message
153+ stays in the queue,
154+
155+- a worker acquiring an output message will always produce an input message
156+ in another queue. If that fails the output message stays in the
157+ queue. There is a caveat here as the input message in the other queue
158+ won't be deleted if the output message cannot be acked.
159+
160+In summary, while we have the guaranty that a message will never be lost, we
161+may encounter cases where duplicate messages appear in the system.
162+
163+To address the duplicate messages we need a way to identify their intent
164+uniquely. In our case, this is the ticket id and the task id.
165+
166+At the workflow level, for a given ticket, we can identify and ignore
167+duplicate messages.
168+
169+=====
170+Swift
171+=====
172+
173+Tasks produce artifacts, logs and results that are stored securely in swift.
174+
175+If a task fails to upload an object, it fails and nacks its incoming message.
176+
177
178=== added file 'docs/images/ticket-worker.dot'
179--- docs/images/ticket-worker.dot 1970-01-01 00:00:00 +0000
180+++ docs/images/ticket-worker.dot 2014-04-01 06:59:36 +0000
181@@ -0,0 +1,12 @@
182+digraph "test worker state automaton" {
183+ "created" [peripheries=2]
184+ "done" [peripheries=2]
185+ "created" -> "started" [label="setup"]
186+ "started" -> "waiting on task" [label="do_task"]
187+ "waiting on task" -> "task succeeded" [label="success"]
188+ "task succeeded" -> "done" [label="tears down"]
189+ "waiting on task" -> "task failed" [label="fails"]
190+ "waiting on task" -> "task failed" [label="task_times_out"]
191+ "task failed" -> "waiting on task" [label="do_task"]
192+ "task failed" -> "done" [label="tears_down"]
193+}

Subscribers

People subscribed via source and target branches