Merge lp:~vila/uci-engine/tr-check-worker-broken-under-load into lp:uci-engine

Proposed by Vincent Ladeuil
Status: Merged
Approved by: Vincent Ladeuil
Approved revision: 873
Merged at revision: 876
Proposed branch: lp:~vila/uci-engine/tr-check-worker-broken-under-load
Merge into: lp:uci-engine
Diff against target: 18 lines (+6/-0)
1 file modified
test_runner/bin/check_worker.py (+6/-0)
To merge this branch: bzr merge lp:~vila/uci-engine/tr-check-worker-broken-under-load
Reviewer Review Type Date Requested Status
Joe Talbott (community) Approve
PS Jenkins bot (community) continuous-integration Approve
Review via email: mp+240286@code.launchpad.net

Commit message

Document the last remaining cause for leaking rabiit queues from the nagios check.

Description of the change

The in-production test for the test runner is:

                cron_cmd: ./run-python ./test_runner/bin/check_worker.py
                cron_schedule: "0 */2 * * *"

But as noted in the FIXME introduced by this MP, this can fail if the engine
is under load: there is no test runner worker to process the requests so the
timeout triggers and the test fails.

Yet, the request is still in the 'test_runner' queue and when it's processed
it reports to its progrees queue but nobody will ever listen to it and the
queue is leaked.

Filing this MP to not lose track and reduce the gap between lp:uci-engine
and the version deployed on uci-britney.

We've talked about revisiting the in-production tests later so I won't dig
this in the mean time.

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:872
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/1659/
Executed test runs:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/1659/rebuild

review: Approve (continuous-integration)
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:873
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/1664/
Executed test runs:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/1664/rebuild

review: Approve (continuous-integration)
Revision history for this message
Joe Talbott (joetalbott) wrote :

This looks fine to me.

review: Approve
Revision history for this message
Ubuntu CI Bot (uci-bot) wrote :
Download full text (30.2 KiB)

The attempt to merge lp:~vila/uci-engine/tr-check-worker-broken-under-load into lp:uci-engine failed. Below is the output from the failed tests.

Running cm...
Updating source dependencies...
Updating source dependencies...
Updating source dependencies...
Updating source dependencies...
Updating source dependencies...
Updating source dependencies...
uploading webui-content.tgz to swift
Updating source dependencies...
2014-11-03 19:00:28 INFO juju.cmd supercommand.go:37 running jujud [1.20.11.1-precise-amd64 gc]
2014-11-03 19:00:28 DEBUG juju.agent agent.go:377 read agent config, format "1.18"
2014-11-03 19:00:28 INFO juju.jujud unit.go:78 unit agent unit-ci-airline-ts-block-storage-broker-0 start (1.20.11.1-precise-amd64 [gc])
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "api"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:176 connection established to "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:176 connection established to "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:242 dialing "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.state.api apiclient.go:176 connection established to "wss://10.0.3.1:17070/"
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "upgrader"
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "logger"
2014-11-03 19:00:28 DEBUG juju.worker.logger logger.go:35 initial log config: "<root>=DEBUG"
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "uniter"
2014-11-03 19:00:28 DEBUG juju.worker.logger logger.go:60 logger setup
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "apiaddressupdater"
2014-11-03 19:00:28 INFO juju.worker runner.go:260 start "rsyslog"
2014-11-03 19:00:28 DEBUG juju.worker.rsyslog worker.go:75 starting rsyslog worker mode 1 for "unit-ci-airline-ts-block-storage-broker-0" "tarmac-local"
2014-11-03 19:00:28 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2014-11-03 19:00:33 INFO juju-log Running install hook
2014-11-03 19:00:56 INFO install Reading package lists...
2014-11-03 19:00:57 INFO install Building dependency tree...
2014-11-03 19:00:57 INFO install Reading state information...
2014-11-03 19:00:57 INFO install The following NEW packages will be installed:
2014-11-03 19:00:57 INFO install ubuntu-cloud-keyring
2014-11-03 19:00:57 INFO install 0 upgraded, 1 newly installed, 0 to remove and 3 not upgraded.
2014-11-03 19:00:57 INFO install Need to get 5144 B of archives.
2014-11-03 19:00:57 INFO install After this operation, 34.8 kB of additional disk space will be used.
2014-11-03 19:00:57 INFO install Get:1 http://archive.ubuntu.com/ubuntu/ precise-updates/universe ubuntu-cloud-keyring all 2012.08.14~12.04.1 [5144 B]
2014-11-03 19:00:58 INFO install Fetched 5144 B in 0s (11.2 kB/s)
2014-11-03 19:01:01 INFO install Selecting previously unselected package ubuntu-cloud-keyring.
2014-11-03 19:01:01 INFO install (Reading d...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'test_runner/bin/check_worker.py'
2--- test_runner/bin/check_worker.py 2014-10-30 17:30:49 +0000
3+++ test_runner/bin/check_worker.py 2014-11-03 13:18:30 +0000
4@@ -31,9 +31,15 @@
5 pass
6
7
8+# 7 minutes should be enough to process libpng on trusty
9 TIMEOUT = 60 * 7 # 420 seconds
10
11
12+# FIXME: This is wrong by design: it could happen (and it did in real life)
13+# that no worker is available for 7 minutes. This leads to a timeout and the
14+# progress queue staying alive without anybody subscribing to it. When the
15+# request is processed later, it left the expected 6 messages hanging around in
16+# the queue... -- vila 2014-10-31
17 def timeout_handler(sig_num, frame):
18 raise Timeout('No worker responding after {} seconds'.format(TIMEOUT))
19

Subscribers

People subscribed via source and target branches