Merge lp:~vila/ubuntu-ci-services-itself/1289273-racy-config-file into lp:ubuntu-ci-services-itself

Proposed by Vincent Ladeuil
Status: Merged
Approved by: Vincent Ladeuil
Approved revision: 341
Merged at revision: 341
Proposed branch: lp:~vila/ubuntu-ci-services-itself/1289273-racy-config-file
Merge into: lp:ubuntu-ci-services-itself
Diff against target: 108 lines (+50/-6)
3 files modified
TRICKS (+34/-0)
charms/precise/rabbitmq-worker/hooks/hooks.py (+12/-5)
test_runner/tstrun/__init__.py (+4/-1)
To merge this branch: bzr merge lp:~vila/ubuntu-ci-services-itself/1289273-racy-config-file
Reviewer Review Type Date Requested Status
Evan (community) Approve
PS Jenkins bot (community) continuous-integration Approve
Andy Doan Pending
Review via email: mp+209940@code.launchpad.net

Commit message

Avoid races for rabbit mq config file by renaming the file once it's complete.

Description of the change

This fixes the race encountered around the rabbit worker config file.

I suspect the same kind of race occurred for the service file so I applied the same fix (Andy, what do you think ?).

I've been unable to setup a deployment for the last hours so I'm proposing anyway, if you have a working deployment...

To post a comment you must log in.
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

FAILED: Continuous integration, rev:338
No commit message was specified in the merge proposal. Click on the following link and set the commit message (if you want a jenkins rebuild you need to trigger it yourself):
https://code.launchpad.net/~vila/ubuntu-ci-services-itself/1289273-racy-config-file/+merge/209940/+edit-commit-message

http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/327/
Executed test runs:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/327/rebuild

review: Needs Fixing (continuous-integration)
Revision history for this message
Andy Doan (doanac) wrote :

It seems sane. Lets try and deploy it once though. The removal of the f.sync calls makes me nervous. I added that originally because I saw a bug. Its not the first time I've seen python code not seem to *really* close/flush a file to disk in time for an immediate call to subprocess that requires the file:

 http://git.linaro.org/lava/lava-dispatcher.git/blob/HEAD:/lava_dispatcher/device/sdmux.py#l53

Anyway - i think its sane, but I'd like to see it deploy for us

339. By Vincent Ladeuil

Keep the temp file ! os.fsync() restored, witnessed a 'No such file or directory' on the following os.rename() in 1 of the three deployed workers...

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:339
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/332/
Executed test runs:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/332/rebuild

review: Approve (continuous-integration)
340. By Vincent Ladeuil

Re-remove os.fsync() as previous tests were broken.

Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

PASSED: Continuous integration, rev:340
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/336/
Executed test runs:

Click here to trigger a rebuild:
http://s-jenkins.ubuntu-ci:8080/job/uci-engine-ci/336/rebuild

review: Approve (continuous-integration)
341. By Vincent Ladeuil

juju status accepts regexps \o/

Revision history for this message
Vincent Ladeuil (vila) wrote :

I've now completed 2 successful full deployments without any errors.

On top of that, I've done 3 terminate-machine --force (all workers) / deploy.py (same code, so safe) runs without errors.

@Andy: I'm now convinced the patch does the right thing. The failures I saw on Friday were a different kind of race introduced by stupidly not using delete=False.

What happened with delete=True (the default) is that it succeeded "often" because the rename succeeded before the os really removed the tmp file.

Revision history for this message
Vincent Ladeuil (vila) wrote :

@Andy: I still keep my eyes opened about the service file.

Revision history for this message
Evan (ev) wrote :

82 + os.rename(f.name, path)

os.unlink(f.name)?

Otherwise looks good. +1

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'TRICKS'
2--- TRICKS 2014-02-28 19:22:57 +0000
3+++ TRICKS 2014-03-08 10:02:17 +0000
4@@ -225,3 +225,37 @@
5 out, which will cause Juju to fail mid-deployment.
6
7 $ for x in $(nova secgroup-list | tail -n +5 |awk '{ print $2 }' | grep -v '^$'); do nova secgroup-delete $x; done
8+
9+Making juju status useful again
10+===============================
11+
12+Our deployment is now too big to use 'watch -n 5 juju status', but 'juju status' is smart:
13+
14+$ juju status bsb-worker
15+environment: hp
16+machines:
17+ "21":
18+ agent-state: started
19+ agent-version: 1.17.4
20+ dns-name: 15.125.100.252
21+ instance-id: 305ddc5b-aa77-4d1f-9087-780f8335f6ba
22+ instance-state: ACTIVE
23+ series: precise
24+ hardware: arch=amd64 cpu-cores=1 mem=1024M root-disk=10240M
25+services:
26+ bsb-worker:
27+ charm: local:precise/rabbitmq-worker-0
28+ exposed: false
29+ relations:
30+ amqp:
31+ - rabbit
32+ units:
33+ bsb-worker/3:
34+ agent-state: started
35+ agent-version: 1.17.4
36+ machine: "21"
37+ public-address: 15.125.100.252
38+
39+So 'watch -n 5 juju status <regexp>' can be use to monitor a subset of the
40+deployment and is useful learning tool about how machines, units and
41+services comes up.
42
43=== modified file 'charms/precise/rabbitmq-worker/hooks/hooks.py'
44--- charms/precise/rabbitmq-worker/hooks/hooks.py 2014-03-04 16:12:49 +0000
45+++ charms/precise/rabbitmq-worker/hooks/hooks.py 2014-03-08 10:02:17 +0000
46@@ -6,6 +6,7 @@
47 import shutil
48 import subprocess
49 import sys
50+import tempfile
51 import textwrap
52 import charmhelpers.fetch
53
54@@ -137,20 +138,26 @@
55 'uid': config.get('uid', 'nobody'),
56 'gid': config.get('gid', 'nogroup'),
57 }
58- with open('/etc/init/%s.conf' % _service_name(config), 'w') as f:
59+ path = '/etc/init/{}.conf'.format(_service_name(config))
60+ with tempfile.NamedTemporaryFile('w', delete=False) as f:
61 f.write(template.format(**params))
62- # need this so the file is ready for the subprocess call
63- f.flush()
64- os.fsync(f.fileno())
65+ # Rename to final destination
66+ os.rename(f.name, path)
67
68
69 def _create_amqp_config(config):
70- with open(os.path.join(_service_dir(config), 'amqp_config.py'), 'w') as f:
71+ path = os.path.join(_service_dir(config), 'amqp_config.py')
72+ with tempfile.NamedTemporaryFile('w', delete=False) as f:
73 f.write('# DO NOT EDIT. Generated by restish charm hook\n')
74 f.write('AMQP_USER = "%s"\n' % config['amqp-user'])
75 f.write('AMQP_VHOST = "%s"\n' % config['amqp-vhost'])
76 f.write('AMQP_HOST = "%s"\n' % _relation_get('private-address'))
77 f.write('AMQP_PASSWORD = "%s"\n' % _relation_get('password'))
78+ # Now that the file is complete (and closed) rename it to its final
79+ # destination. This avoids races where another process try to import that
80+ # file as a python module (creating a .pyc) and ends up having close enough
81+ # time stamps to fool python (which won't reload the completed file).
82+ os.rename(f.name, path)
83
84
85 def amqp_relation_joined(config):
86
87=== modified file 'test_runner/tstrun/__init__.py'
88--- test_runner/tstrun/__init__.py 2014-02-28 09:25:47 +0000
89+++ test_runner/tstrun/__init__.py 2014-03-08 10:02:17 +0000
90@@ -16,6 +16,9 @@
91 import os
92 import yaml
93
94+
95+HERE = os.path.abspath(os.path.dirname(__file__))
96+
97 # FIXME: Everything below sounds like config material, may be they should just
98 # be moved into their own config.py file. -- vila 2014-02-05
99
100@@ -27,7 +30,7 @@
101 # data store require nova credentials.
102 def get_auth_config(path=None):
103 if path is None:
104- path = os.path.join(os.path.dirname(__file__), '../../unit_config')
105+ path = os.path.abspath(os.path.join(HERE, '../../unit_config'))
106 with open(path) as f:
107 config = yaml.safe_load(f)
108 return config

Subscribers

People subscribed via source and target branches