Merge lp:~vila/uci-engine/prod-britney into lp:uci-engine

Proposed by Vincent Ladeuil
Status: Work in progress
Proposed branch: lp:~vila/uci-engine/prod-britney
Merge into: lp:uci-engine
Diff against target: 240 lines (+52/-19)
10 files modified
britney_proxy/britney/process_requests.py (+1/-1)
britney_proxy/britney/process_results.py (+22/-5)
ci-utils/ci_utils/amqp_worker.py (+2/-0)
juju-deployer/test-runner.yaml.tmpl (+3/-2)
test_runner/bin/check_worker.py (+4/-3)
test_runner/tstrun/run_test.py (+2/-2)
test_runner/tstrun/run_worker.py (+9/-1)
test_runner/tstrun/testbed.py (+3/-0)
test_runner/tstrun/tests/test_testbed.py (+0/-1)
test_runner/tstrun/tests/test_worker.py (+6/-4)
To merge this branch: bzr merge lp:~vila/uci-engine/prod-britney
Reviewer Review Type Date Requested Status
Canonical CI Engineering Pending
Review via email: mp+240360@code.launchpad.net

Commit message

Running on uci-engine during validation

Description of the change

Work in progress carrying the currently tested delta on uci-britney@bootstack.

Mostly to get a better picture of what still needs to land on lp:uci-engine.

To post a comment you must log in.
lp:~vila/uci-engine/prod-britney updated
914. By Vincent Ladeuil

The ~300 failures to setup the testbed were caused by using an illegal hostname inherited from the nova instance name.

915. By Vincent Ladeuil

The test runner check_worker.py in-production test can fail during high
load.

This is ironic as basically it means: It if fails it's broken. And
this is a bit hard to desambiguate from: it fails when something
*else*[1] is broken.

I.e. it's a false positive and creates noise.

Alternatively, a higher level controller could have another way to
measure the load and knowing that the load is high, swallow that failure
by cleaning the queue silently.

[1]: Optionally but more importantly O_o

916. By Vincent Ladeuil

Document the trick used to create a vivid image from an utopic one.

917. By Vincent Ladeuil

Hack procress_result to work around testbed not properly setup (including wrong host name) issues.

918. By Vincent Ladeuil

Merge britney resolving conflicts

919. By Vincent Ladeuil

Add autodep8 as a depedency for the test runner for vivid packages that relies on it.

920. By Vincent Ladeuil

Decode file content or we can't encode them back.

921. By Vincent Ladeuil

Hack swift retries around container deletion.

922. By Vincent Ladeuil

Add FIXME about cloud-init earsing apt sources.list from adt-setup-vm and the currently tried workaround.

923. By Vincent Ladeuil

We do have a public IP for rabbit now.

924. By Vincent Ladeuil

Make it easier to reuse the test for the britney use case.

925. By Vincent Ladeuil

Merge trunk, resolving conflicts

926. By Vincent Ladeuil

Merge trunk

927. By Vincent Ladeuil

Disable the test in-production until it supports running on a loaded engine.

928. By Vincent Ladeuil

Whether we have an public IP or not, we still need to expose rabbit.

Despite the fact that in the uci-britney deployment a specific security group has been defined to give access to snakefruit with the same rules that juju create when exposing a service, the juju group is still required.

929. By Vincent Ladeuil

One bug leading to OOM.

930. By Vincent Ladeuil

merge trunk

Unmerged revisions

930. By Vincent Ladeuil

merge trunk

929. By Vincent Ladeuil

One bug leading to OOM.

928. By Vincent Ladeuil

Whether we have an public IP or not, we still need to expose rabbit.

Despite the fact that in the uci-britney deployment a specific security group has been defined to give access to snakefruit with the same rules that juju create when exposing a service, the juju group is still required.

927. By Vincent Ladeuil

Disable the test in-production until it supports running on a loaded engine.

926. By Vincent Ladeuil

Merge trunk

925. By Vincent Ladeuil

Merge trunk, resolving conflicts

924. By Vincent Ladeuil

Make it easier to reuse the test for the britney use case.

923. By Vincent Ladeuil

We do have a public IP for rabbit now.

922. By Vincent Ladeuil

Add FIXME about cloud-init earsing apt sources.list from adt-setup-vm and the currently tried workaround.

921. By Vincent Ladeuil

Hack swift retries around container deletion.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'britney_proxy/britney/process_requests.py'
--- britney_proxy/britney/process_requests.py 2014-11-04 09:05:06 +0000
+++ britney_proxy/britney/process_requests.py 2014-11-17 13:01:51 +0000
@@ -64,7 +64,7 @@
64 # debug purposes).64 # debug purposes).
65 output_queue = params.get('output_queue', self.output_queue)65 output_queue = params.get('output_queue', self.output_queue)
66 report_queue = params.get('report_queue', None)66 report_queue = params.get('report_queue', None)
67 params = dict(ticket_id=str(uuid.uuid4()),67 params = dict(ticket_id='britney-{}'.format(str(uuid.uuid4())),
68 progress_trigger=output_queue,68 progress_trigger=output_queue,
69 series=series,69 series=series,
70 architecture=architecture,70 architecture=architecture,
7171
=== modified file 'britney_proxy/britney/process_results.py'
--- britney_proxy/britney/process_results.py 2014-10-30 17:30:49 +0000
+++ britney_proxy/britney/process_results.py 2014-11-17 13:01:51 +0000
@@ -21,6 +21,7 @@
21import logging21import logging
22import os22import os
23import sys23import sys
24import time
2425
2526
26from britney import queues27from britney import queues
@@ -97,7 +98,19 @@
97 # logging an exception should be enough. -- vila 2014-10-0298 # logging an exception should be enough. -- vila 2014-10-02
98 value = self.tr_data_store.get_file(name)99 value = self.tr_data_store.get_file(name)
99 target = self.mapper.get_artifact_name(name)100 target = self.mapper.get_artifact_name(name)
100 self.britney_data_store.put_file(target, value, 'text/plain')101 return self.britney_data_store.put_file(target, value, 'text/plain')
102
103 def delete_tr_data_store(self):
104 # Last sleep is 0.0 so we don't wait after the last attempt
105 for sleep in (5.0, 10.0, 60.0, 120.0, 0.0):
106 try:
107 self.tr_data_store.delete(recursive=True)
108 # We're done
109 return
110 except data_store.DataStoreException:
111 logging.exception('Cannot delete, wait {}s'.format(sleep))
112 # Try again
113 time.sleep(sleep)
101114
102 def handle_result(self, params):115 def handle_result(self, params):
103 # Copy the produced results from the test request swift container into116 # Copy the produced results from the test request swift container into
@@ -116,17 +129,21 @@
116 # uci-engine testbed hostnames are meaningless, so we just use a129 # uci-engine testbed hostnames are meaningless, so we just use a
117 # generic name here.130 # generic name here.
118 'uci-testbed')131 'uci-testbed')
132 # FIXME: This can fail if the testbed setup fails as the files are not
133 # produced in this case. But that shouldn't matter if we just copy all
134 # the files produced by adt-run. -- vila 2014-10-31
119 self.copy_artifact('{}.log'.format(package))135 self.copy_artifact('{}.log'.format(package))
120 self.copy_artifact('{}.britney.results'.format(package))136 results_path = '{}.britney.results'.format(package)
137 self.copy_artifact(results_path)
138 results_value = self.tr_data_store.get_file(results_path)
121 # The test runner swift container has been processed and can die139 # The test runner swift container has been processed and can die
122 self.tr_data_store.delete(recursive=True)140 self.delete_tr_data_store()
123 report_queue = request.get('report_queue', None)141 report_queue = request.get('report_queue', None)
124 if report_queue is not None:142 if report_queue is not None:
125 # Publish a message to report where the results have been stored143 # Publish a message to report where the results have been stored
126 queue = queues.PublisherQueue(report_queue)144 queue = queues.PublisherQueue(report_queue)
127 msg = dict(result_at=self.mapper.dir_name,145 msg = dict(result_at=self.mapper.dir_name,
128 britney_result=self.tr_data_store.get_file(146 britney_result=results_value)
129 '{}.britney.results'.format(package)))
130 queue.publish(msg)147 queue.publish(msg)
131148
132 def handle(self, params):149 def handle(self, params):
133150
=== modified file 'ci-utils/ci_utils/amqp_worker.py'
--- ci-utils/ci_utils/amqp_worker.py 2014-10-29 16:43:39 +0000
+++ ci-utils/ci_utils/amqp_worker.py 2014-11-17 13:01:51 +0000
@@ -159,6 +159,8 @@
159 '[%(asctime)s] %(name)s:%(levelname)s:%(message)s')159 '[%(asctime)s] %(name)s:%(levelname)s:%(message)s')
160 logstream.setFormatter(formatter)160 logstream.setFormatter(formatter)
161 logstream.setLevel(logging.INFO)161 logstream.setLevel(logging.INFO)
162 # FIXME: This is called for each request and never remove leading to
163 # OOM for long running workers -- vila 2014-11-17
162 log.addHandler(logstream)164 log.addHandler(logstream)
163 return log165 return log
164166
165167
=== modified file 'juju-deployer/test-runner.yaml.tmpl'
--- juju-deployer/test-runner.yaml.tmpl 2014-11-04 12:59:25 +0000
+++ juju-deployer/test-runner.yaml.tmpl 2014-11-17 13:01:51 +0000
@@ -8,7 +8,7 @@
8 main: ./run-python ./test_runner/tstrun/run_worker.py8 main: ./run-python ./test_runner/tstrun/run_worker.py
9 current_code: ${CI_PAYLOAD_URL}9 current_code: ${CI_PAYLOAD_URL}
10 available_code: ${CI_PAYLOAD_URL}10 available_code: ${CI_PAYLOAD_URL}
11 packages: "python-requests python-novaclient python-swiftclient python-glanceclient python-uci-vms autopkgtest haveged python-subunit python-testtools python-lazr.enum python-kombu"11 packages: "python-requests python-novaclient python-swiftclient python-glanceclient python-uci-vms autodep8 autopkgtest haveged python-subunit python-testtools python-lazr.enum python-kombu"
12 unit-config: include-base64://configs/unit_config.yaml12 unit-config: include-base64://configs/unit_config.yaml
13 uid: ubuntu13 uid: ubuntu
14 gid: ubuntu14 gid: ubuntu
@@ -19,7 +19,8 @@
19 - null19 - null
20 - null20 - null
21 cron_cmd: ./run-python ./test_runner/bin/check_worker.py21 cron_cmd: ./run-python ./test_runner/bin/check_worker.py
22 cron_schedule: "0 */2 * * *"22# FIXME: This create too much noise for now -- vila 2014-11-14
23# cron_schedule: "0 */2 * * *"
23 nagios_context: ci-airline-staging24 nagios_context: ci-airline-staging
24 nagios_check_health_params: -t 7800 test_runner.health25 nagios_check_health_params: -t 7800 test_runner.health
25 ci-airline-rabbit:26 ci-airline-rabbit:
2627
=== modified file 'test_runner/bin/check_worker.py'
--- test_runner/bin/check_worker.py 2014-11-04 09:05:06 +0000
+++ test_runner/bin/check_worker.py 2014-11-17 13:01:51 +0000
@@ -38,8 +38,8 @@
38# FIXME: This is wrong by design: it could happen (and it did in real life)38# FIXME: This is wrong by design: it could happen (and it did in real life)
39# that no worker is available for 7 minutes. This leads to a timeout and the39# that no worker is available for 7 minutes. This leads to a timeout and the
40# progress queue staying alive without anybody subscribing to it. When the40# progress queue staying alive without anybody subscribing to it. When the
41# request is processed later, it left the expected 6 messages hanging around in41# request is processed later, it leaves up to the expected 6 messages hanging
42# the queue... -- vila 2014-10-3142# around in the queue... -- vila 2014-10-31
43def timeout_handler(sig_num, frame):43def timeout_handler(sig_num, frame):
44 raise Timeout('No worker responding after {} seconds'.format(TIMEOUT))44 raise Timeout('No worker responding after {} seconds'.format(TIMEOUT))
4545
@@ -80,7 +80,7 @@
8080
8181
82def check_worker():82def check_worker():
83 queue = 'testrun-test-{}'.format(uuid.uuid4())83 queue = 'health-check-test-runner-{}'.format(uuid.uuid4())
84 series = 'trusty'84 series = 'trusty'
85 architecture = 'amd64'85 architecture = 'amd64'
86 image_id = image_store.uci_image_name('cloudimg', series, architecture)86 image_id = image_store.uci_image_name('cloudimg', series, architecture)
@@ -93,6 +93,7 @@
93 'ppa_list': [],93 'ppa_list': [],
94 'progress_trigger': queue,94 'progress_trigger': queue,
95 }95 }
96 logging.info('Sending {}'.format(params))
96 amqp_utils.send(97 amqp_utils.send(
97 amqp_utils.TEST_RUNNER_QUEUE, json.dumps(params), True)98 amqp_utils.TEST_RUNNER_QUEUE, json.dumps(params), True)
98 try:99 try:
99100
=== modified file 'test_runner/tstrun/run_test.py'
--- test_runner/tstrun/run_test.py 2014-10-17 12:28:14 +0000
+++ test_runner/tstrun/run_test.py 2014-11-17 13:01:51 +0000
@@ -87,11 +87,11 @@
87 out_path = os.path.join('results', '{}-stdout'.format(name))87 out_path = os.path.join('results', '{}-stdout'.format(name))
88 if os.path.exists(out_path):88 if os.path.exists(out_path):
89 with open(out_path) as f:89 with open(out_path) as f:
90 details['stdout'] = content.text_content(f.read())90 details['stdout'] = content.text_content(f.read().decode('utf8'))
91 err_path = os.path.join('results', '{}-stderr'.format(name))91 err_path = os.path.join('results', '{}-stderr'.format(name))
92 if os.path.exists(err_path):92 if os.path.exists(err_path):
93 with open(err_path) as f:93 with open(err_path) as f:
94 details['stderr'] = content.text_content(f.read())94 details['stderr'] = content.text_content(f.read().decode('utf8'))
95 return details95 return details
9696
9797
9898
=== modified file 'test_runner/tstrun/run_worker.py'
--- test_runner/tstrun/run_worker.py 2014-11-04 15:16:50 +0000
+++ test_runner/tstrun/run_worker.py 2014-11-17 13:01:51 +0000
@@ -154,7 +154,15 @@
154 self.status_cb('Setting up the testbed for ticket {}'.format(154 self.status_cb('Setting up the testbed for ticket {}'.format(
155 self.ticket_id))155 self.ticket_id))
156 auth_conf = unit_config.get_auth_config()156 auth_conf = unit_config.get_auth_config()
157 tb_name = '{}-testbed-{}'.format(package, self.ticket_id)157 # We can't use arbitrary host names which rules out using the
158 # package name and the ticket id. At least mentioning series/arch
159 # provides some hint about what is running there...
160 # FIXME: Investigate whether we can set some nova attributes for
161 # debug/monitoring purposes -- vila 2014-10-31
162 # FIXME: Separate nova name from host name so we can guarantee
163 # using a proper hostname while still having an informative nova
164 # name. -- vila 2014-11-01.
165 tb_name = 'uci-testbed-{}-{}'.format(series, architecture)
158 conf = testbed.vms_config_from_auth_config(tb_name, auth_conf)166 conf = testbed.vms_config_from_auth_config(tb_name, auth_conf)
159 conf.set('vm.image', image_id)167 conf.set('vm.image', image_id)
160 conf.set('vm.release', series)168 conf.set('vm.release', series)
161169
=== modified file 'test_runner/tstrun/testbed.py'
--- test_runner/tstrun/testbed.py 2014-10-17 12:28:22 +0000
+++ test_runner/tstrun/testbed.py 2014-11-17 13:01:51 +0000
@@ -307,6 +307,9 @@
307 self.ensure_ssh_works()307 self.ensure_ssh_works()
308 ppas = self.conf.get('vm.ppas')308 ppas = self.conf.get('vm.ppas')
309 if ppas:309 if ppas:
310 # FIXME: britney set 'apt_preserve_sources_list: true' in
311 # /etc/cloud/cloud.cfg, make sure we don't break that or put the
312 # same into user-data -- vila 2014-11-07
310 cmd = ['sudo', 'add-apt-repository']313 cmd = ['sudo', 'add-apt-repository']
311 if self.conf.get('vm.release') > 'precise':314 if self.conf.get('vm.release') > 'precise':
312 cmd.append('--enable-source')315 cmd.append('--enable-source')
313316
=== modified file 'test_runner/tstrun/tests/test_testbed.py'
--- test_runner/tstrun/tests/test_testbed.py 2014-11-04 12:07:12 +0000
+++ test_runner/tstrun/tests/test_testbed.py 2014-11-17 13:01:51 +0000
@@ -341,7 +341,6 @@
341341
342 @tests.log_on_failure()342 @tests.log_on_failure()
343 def test_create_usable_testbed(self, logger):343 def test_create_usable_testbed(self, logger):
344 self.conf.set('vm.release', 'trusty')
345 self.conf.set('vm.image', self.get_image_id())344 self.conf.set('vm.image', self.get_image_id())
346 tb = testbed.TestBed(self.conf, logger)345 tb = testbed.TestBed(self.conf, logger)
347 self.addCleanup(tb.teardown)346 self.addCleanup(tb.teardown)
348347
=== modified file 'test_runner/tstrun/tests/test_worker.py'
--- test_runner/tstrun/tests/test_worker.py 2014-11-04 12:07:12 +0000
+++ test_runner/tstrun/tests/test_worker.py 2014-11-17 13:01:51 +0000
@@ -138,7 +138,8 @@
138 # series138 # series
139 ([('precise', dict(series='precise', result='skip')),139 ([('precise', dict(series='precise', result='skip')),
140 ('trusty', dict(series='trusty', result='pass')),140 ('trusty', dict(series='trusty', result='pass')),
141 ('utopic', dict(series='utopic', result='pass'))]),141 ('utopic', dict(series='utopic', result='pass')),
142 ('vivid', dict(series='vivid', result='pass')),]),
142 # architectures143 # architectures
143 ([('amd64', dict(arch='amd64')), ('i386', dict(arch='i386'))]))144 ([('amd64', dict(arch='amd64')), ('i386', dict(arch='i386'))]))
144145
@@ -163,8 +164,8 @@
163 # ticket, only the master ppa is relevant.164 # ticket, only the master ppa is relevant.
164 return [self.conf.get('master_ppa')]165 return [self.conf.get('master_ppa')]
165166
166 def get_image_id(self):167 def get_image_id(self, prefix):
167 return image_store.uci_image_name('cloudimg', self.series, self.arch)168 return image_store.uci_image_name(prefix, self.series, self.arch)
168169
169 def assertTestRun(self, logger, params):170 def assertTestRun(self, logger, params):
170 worker = run_worker.TestRunnerWorker(self.ds_factory)171 worker = run_worker.TestRunnerWorker(self.ds_factory)
@@ -181,8 +182,9 @@
181 params = dict(ticket_id=ticket_id, progress_trigger='progress',182 params = dict(ticket_id=ticket_id, progress_trigger='progress',
182 series=self.series,183 series=self.series,
183 architecture=self.arch,184 architecture=self.arch,
184 image_id=self.get_image_id(),185 image_id=self.get_image_id('cloudimg'),
185 ppa_list=self.get_ppa_list(),186 ppa_list=self.get_ppa_list(),
187# adt_opts=['--apt-upgrade', '--apt-pocket=proposed'],
186 package_list=[package])188 package_list=[package])
187 returns = self.assertTestRun(logger, params)189 returns = self.assertTestRun(logger, params)
188 (retcode, results) = returns[0]190 (retcode, results) = returns[0]

Subscribers

People subscribed via source and target branches