Merge lp:~stub/charms/trusty/cassandra/blamejuju into lp:charms/trusty/cassandra

Proposed by Stuart Bishop
Status: Rejected
Rejected by: Stuart Bishop
Proposed branch: lp:~stub/charms/trusty/cassandra/blamejuju
Merge into: lp:charms/trusty/cassandra
Diff against target: 339 lines (+157/-68)
1 file modified
testing/amuletfixture.py (+157/-68)
To merge this branch: bzr merge lp:~stub/charms/trusty/cassandra/blamejuju
Reviewer Review Type Date Requested Status
Adam Israel (community) Approve
Review Queue (community) automated testing Needs Fixing
Review via email: mp+283311@code.launchpad.net

Description of the change

A provisioning failure rate of 1% means a complex test suite provisioning 50 units will fail half the time. This makes charms that have comprehensive tests look bad when compared to charms with minimal tests that are less likely to hit provisioning issues.

This branch attempts to make things fairer on the charm, skipping tests when a timeout has occurred and we detect that provisioning has failed. False negatives, where tests fail for reasons beyond our control, are a cause of major delays in landing bug fixes and improvements. Instead, I'm chosing potential false positives where a test that should fail is skipped because the cloud happened to be misbehaving. Care will need to be taken when landing, to ensure a run has completed without skipped tests in the CI system. This should keep the tests green, except when the charm really is at fault, and provide a fairer guide to charm quality (although less useful for judging cloud quality).

This test fixture is shared with the PostgreSQL charm and hopefully will one day become a layer to help develop comprehensive test suites.

The output can still be rather noisy when provisioning issues hit during deployment, as amulet has installed a noisy signal handler I can't see how to override. I think tidying this up will first require amulet work.

To post a comment you must log in.
Revision history for this message
Review Queue (review-queue) wrote :

This item has failed automated testing! Results available here http://juju-ci.vapour.ws:8080/job/charm-bundle-test-lxc/2276/

review: Needs Fixing (automated testing)
Revision history for this message
Review Queue (review-queue) wrote :

This item has failed automated testing! Results available here http://juju-ci.vapour.ws:8080/job/charm-bundle-test-aws/2255/

review: Needs Fixing (automated testing)
Revision history for this message
Review Queue (review-queue) wrote :

This item has failed automated testing! Results available here http://juju-ci.vapour.ws:8080/job/charm-bundle-test-lxc/2301/

review: Needs Fixing (automated testing)
Revision history for this message
Review Queue (review-queue) wrote :

This item has failed automated testing! Results available here http://juju-ci.vapour.ws:8080/job/charm-bundle-test-aws/2280/

review: Needs Fixing (automated testing)
Revision history for this message
Adam Israel (aisrael) wrote :

Hey stub,

I've had a chance to review these test changes today. The changes look fine to me, and the tests succeed against AWS.

Two pieces of feedback:
- The tests take a really long time. Like, 10 hours long. Is that consistent with your testing, and if so, do you expect this to always be the case?
- The tests as they are, specifically amuletfixture.py but likely other spots as well, aren't juju 2.0 compatible. We're still running 1.25 in CI, but that will be upgraded to 2.0 upon release.

This has my +1, and I'll merge it once I've finished reviewing your other pending Cassandra changes.

review: Approve

Unmerged revisions

371. By Stuart Bishop

Skip tests and blame others on provisioning failures, rather than victimize the charm

370. By Stuart Bishop

Sync updates to amuletfixture.py from the PostgreSQL charm

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'testing/amuletfixture.py'
2--- testing/amuletfixture.py 2015-09-25 07:33:01 +0000
3+++ testing/amuletfixture.py 2016-01-20 15:19:10 +0000
4@@ -14,43 +14,45 @@
5 # You should have received a copy of the GNU General Public License
6 # along with this program. If not, see <http://www.gnu.org/licenses/>.
7
8-from functools import wraps
9 import json
10 import os
11 import shutil
12 import subprocess
13 import tempfile
14 import time
15+import unittest
16
17 import amulet
18+import amulet.helpers
19+import amulet.sentry
20 import yaml
21
22
23 class AmuletFixture(amulet.Deployment):
24- def __init__(self, series, verbose=False):
25- if verbose:
26- super(AmuletFixture, self).__init__(series=series)
27- else:
28- # We use a wrapper around juju-deployer so we can fix how it is
29- # invoked. In particular, turn off all the noise so we can
30- # actually read our test output.
31- juju_deployer = os.path.abspath(os.path.join(
32- os.path.dirname(__file__), os.pardir, 'lib',
33- 'juju-deployer-wrapper.py'))
34- super(AmuletFixture, self).__init__(series=series,
35- juju_deployer=juju_deployer)
36- assert self.series == series
37+ def __init__(self, series, charm_dir=None):
38+ self.charm_dir = charm_dir # If None, reset by repackage_charm()
39+ # We use a wrapper around juju-deployer so we can adjust how it is
40+ # invoked. In particular, only make noise on failure.
41+ juju_deployer = os.path.abspath(os.path.join(
42+ os.path.dirname(__file__), os.pardir, 'lib',
43+ 'juju-deployer-wrapper.py'))
44+ super(AmuletFixture, self).__init__(series=series,
45+ juju_deployer=juju_deployer)
46
47- def setUp(self):
48+ def setUp(self, keep=None):
49 self._temp_dirs = []
50
51- self.reset_environment(force=True)
52+ if keep:
53+ self.reset_environment(keep=keep)
54+ else:
55+ self.reset_environment(force=True)
56
57 # Repackage our charm to a temporary directory, allowing us
58 # to strip our virtualenv symlinks that would otherwise cause
59 # juju to abort. We also strip the .bzr directory, working
60 # around Bug #1394078.
61- self.repackage_charm()
62+ if self.charm_dir is None:
63+ self.repackage_charm()
64
65 # Fix amulet.Deployment so it doesn't depend on environment
66 # variables or the current working directory, but rather the
67@@ -69,31 +71,35 @@
68 os.environ['JUJU_REPOSITORY'] = temp_repo
69 os.makedirs(os.path.join(temp_repo, self.series), mode=0o700)
70
71- def tearDown(self, reset_environment=True):
72+ def tearDown(self, reset_environment=True, keep=None):
73 if reset_environment:
74- self.reset_environment()
75+ self.reset_environment(keep=keep)
76 if self.org_repo is None:
77 del os.environ['JUJU_REPOSITORY']
78 else:
79 os.environ['JUJU_REPOSITORY'] = self.org_repo
80
81- def deploy(self, timeout=None):
82+ def deploy(self, timeout=None, keep=None):
83 '''Deploying or updating the configured system.
84
85 Invokes amulet.Deployer.setup with a nicer name and standard
86 timeout handling.
87 '''
88+ # First, ensure any existing environment is completely
89+ # torn down. juju-deployer seems to forget to deploy
90+ # services if there is an existing service in the environment
91+ # in the process of being destroyed.
92+ self.reset_environment(keep=keep)
93 if timeout is None:
94 timeout = int(os.environ.get('AMULET_TIMEOUT', 900))
95
96- # juju-deployer is buried under here, and has race conditions.
97- # Sleep a bit before invoking it, so its cached view of the
98- # environment matches reality.
99- time.sleep(15)
100-
101 # If setUp fails, tearDown is never called leaving the
102 # environment setup. This is useful for debugging.
103- self.setup(timeout=timeout)
104+ try:
105+ self.setup(timeout=timeout)
106+ except amulet.helpers.TimeoutError:
107+ self._blame_timeout()
108+ raise
109 self.wait(timeout=timeout)
110
111 def __del__(self):
112@@ -101,6 +107,29 @@
113 if os.path.exists(temp_dir):
114 shutil.rmtree(temp_dir, ignore_errors=True)
115
116+ def add_unit(self, service, units=1, target=None, timeout=None):
117+ # Work around Bug #1510000
118+ if not isinstance(units, int) or units < 1:
119+ raise ValueError('Only positive integers can be used for units')
120+ if target is not None and units != 1:
121+ raise ValueError(
122+ "Can't deploy more than one unit when specifying a target.")
123+ if service not in self.services:
124+ raise ValueError('Service needs to be added before you can scale')
125+
126+ self.services[service]['num_units'] = \
127+ self.services[service].get('num_units', 1) + units
128+
129+ if self.deployed:
130+ args = ['add-unit', service, '-n', str(units)]
131+ if target is not None:
132+ args.extend(["--to", target])
133+ amulet.helpers.juju(args)
134+ if timeout is None:
135+ timeout = int(os.environ.get('AMULET_TIMEOUT', 900))
136+ self.sentry = amulet.sentry.Talisman(self.services,
137+ timeout=timeout)
138+
139 def get_status(self):
140 try:
141 raw = subprocess.check_output(['juju', 'status', '--format=json'],
142@@ -114,16 +143,96 @@
143
144 def wait(self, timeout=None):
145 '''Wait until the environment has reached a stable state.'''
146+ cmd = ['juju', 'wait', '-q']
147 if timeout is None:
148 timeout = int(os.environ.get('AMULET_TIMEOUT', 900))
149- cmd = ['timeout', str(timeout), 'juju', 'wait', '-q']
150+ cmd = ['timeout', str(timeout)] + cmd
151 try:
152 subprocess.check_output(cmd, universal_newlines=True)
153 except subprocess.CalledProcessError as x:
154+ if x.returncode == 124:
155+ self._blame_timeout() # Maybe raises an exception.
156 print(x.output)
157 raise
158
159- def reset_environment(self, force=False):
160+ # Override if you want a fatal exception.
161+ blame_exception = unittest.SkipTest
162+
163+ def blame(self, headline, msg):
164+ '''
165+ If the failure is not our fault, skip the test. This will
166+ lead to false positive test runs, but seems better than
167+ bogging down reviews because a cloud 1% failure rate
168+ translates to a 50% test suite failure rate. Once we stop
169+ expecting juju and providers to fail, we can stop this
170+ and go back to reasonable behaviour.
171+
172+ Override if you want a fatal exception.
173+ '''
174+ raise unittest.SkipTest('{} {}'.format(headline.upper(), msg))
175+
176+ def _blame_timeout(self):
177+ '''A timeout happened. Investigate and allocate blame.
178+
179+ Raises an self.blame_exception if failure determined to be Juju
180+ or provider fault.
181+
182+ Does nothing if blame cannot be made.
183+ '''
184+ status = self.get_status()
185+
186+ provisioning = "Provisioning Failure"
187+ if len(status.get('machines', {})) <= 1:
188+ self.blame(provisioning,
189+ 'No machines provisioned')
190+ for machine, info in status.get('machines', {}).items():
191+ agent_state = info.get('agent-state')
192+ if agent_state != 'started':
193+ self.blame(provisioning,
194+ 'Machine {} in {} state'.format(machine,
195+ agent_state))
196+ if not info.get('dns-name'):
197+ self.blame(provisioning,
198+ 'Machine {} has no IP address'.format(machine))
199+
200+ if not info.get('instance-id'):
201+ self.blame(provisioning,
202+ 'Machine {} has no instance-id'.format(machine))
203+
204+ if not info.get('agent-version'):
205+ self.blame(provisioning,
206+ 'Machine {} has no juju agent'.format(machine))
207+
208+ # TODO: We don't check subordinates yet. Do these ever fail if
209+ # the primary is provisioned fine?
210+ for service, service_info in status.get('services', {}).items():
211+ leader_found = False
212+ for unit, unit_info in service_info.get('units', {}).items():
213+ if not unit_info.get('machine'):
214+ self.blame(provisioning,
215+ 'Unit {} has not been allocated a machine'
216+ ''.format(unit))
217+ if not unit_info.get('public-address'):
218+ self.blame(provisioning,
219+ 'Unit {} has no IP address'.format(unit))
220+ if not unit_info.get('agent-version'):
221+ self.blame(provisioning,
222+ 'Unit {} has no agent'.format(unit))
223+ agent_state = unit_info.get('agent-state')
224+ if agent_state != 'started':
225+ self.blame(provisioning,
226+ 'Unit {} has agent state {}'
227+ ''.format(agent_state))
228+ cmd = ['run', '--unit', unit, 'is-leader']
229+ if amulet.helpers.juju(cmd).strip() == 'True':
230+ leader_found = True
231+ if not leader_found:
232+ self.blame("Juju Leadership Failure",
233+ "Service {} has no leader".format(service))
234+
235+ def reset_environment(self, force=False, keep=None):
236+ if keep is None:
237+ keep = frozenset()
238 if force:
239 status = self.get_status()
240 machines = [m for m in status.get('machines', {}).keys()
241@@ -133,50 +242,47 @@
242 '--force'] + machines,
243 stdout=subprocess.DEVNULL,
244 stderr=subprocess.DEVNULL)
245+
246 fails = dict()
247+ keep_machines = set(["0"])
248 while True:
249 status = self.get_status()
250 service_items = status.get('services', {}).items()
251- if not service_items:
252- break
253+
254 for service_name, service in service_items:
255+ if service_name in keep:
256+ # Don't mess with this service.
257+ keep_machines.update([unit['machine'] for unit
258+ in service['units'].values()])
259+ continue
260+
261 if service.get('life', '') not in ('dying', 'dead'):
262 subprocess.call(['juju', 'destroy-service', service_name],
263 stdout=subprocess.PIPE,
264 stderr=subprocess.STDOUT)
265+
266 for unit_name, unit in service.get('units', {}).items():
267 if unit.get('agent-state', None) == 'error':
268- if force:
269- # If any units have failed hooks, unstick them.
270- # This should no longer happen now we are
271- # using the 'destroy-machine --force' command
272- # earlier.
273- try:
274- subprocess.check_output(
275- ['juju', 'resolved', unit_name],
276- stderr=subprocess.STDOUT)
277- except subprocess.CalledProcessError:
278- # A previous 'resolved' call make cause a
279- # subsequent one to fail if it is still
280- # being processed. However, we need to keep
281- # retrying because after a successful
282- # resolution a subsequent hook may cause an
283- # error state.
284- pass
285- else:
286- fails[unit_name] = unit
287+ fails[unit_name] = unit
288+
289+ services = set(k for k, v in service_items
290+ if k not in keep)
291+ if not services:
292+ break
293+
294 time.sleep(1)
295
296 harvest_machines = []
297 for machine, state in status.get('machines', {}).items():
298- if machine != "0" and state.get('life') not in ('dying', 'dead'):
299+ if machine not in keep_machines and (state.get('life')
300+ not in ('dying', 'dead')):
301 harvest_machines.append(machine)
302
303 if harvest_machines:
304 cmd = ['juju', 'remove-machine', '--force'] + harvest_machines
305 subprocess.check_output(cmd, stderr=subprocess.STDOUT)
306
307- if fails:
308+ if fails and not force:
309 raise Exception("Teardown failed", fails)
310
311 def repackage_charm(self):
312@@ -206,10 +312,6 @@
313
314 repack_root = tempfile.mkdtemp(suffix='.charm')
315 self._temp_dirs.append(repack_root)
316- # juju-deployer now requires the series in the path when
317- # deploying from an absolute path.
318- repack_root = os.path.join(repack_root, self.series)
319- os.makedirs(repack_root, mode=0o700)
320
321 self.charm_dir = os.path.join(repack_root, self.charm_name)
322
323@@ -219,16 +321,3 @@
324 # infinite recursion.
325 shutil.copytree(src_charm_dir, self.charm_dir, symlinks=True,
326 ignore=shutil.ignore_patterns('.venv?', '.bzr'))
327-
328-
329-# Bug #1417097 means we need to monkey patch Amulet for now.
330-real_juju = amulet.helpers.juju
331-
332-
333-@wraps(real_juju)
334-def patched_juju(args, env=None):
335- args = [str(a) for a in args]
336- return real_juju(args, env)
337-
338-amulet.helpers.juju = patched_juju
339-amulet.deployer.juju = patched_juju

Subscribers

People subscribed via source and target branches

to all changes: