Merge ~rbalint/autopkgtest-cloud:more-retries into autopkgtest-cloud:master
- Git
- lp:~rbalint/autopkgtest-cloud
- more-retries
- Merge into master
Status: | Merged | ||||||||
---|---|---|---|---|---|---|---|---|---|
Merge reported by: | Balint Reczey | ||||||||
Merged at revision: | bc4d6865bf81ffd8f5958b5dabc503fef9289f94 | ||||||||
Proposed branch: | ~rbalint/autopkgtest-cloud:more-retries | ||||||||
Merge into: | autopkgtest-cloud:master | ||||||||
Diff against target: |
156 lines (+65/-53) 1 file modified
worker/worker (+65/-53) |
||||||||
Related bugs: |
|
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Iain Lane | Approve | ||
Review via email: mp+384609@code.launchpad.net |
Commit message
Description of the change
Balint Reczey (rbalint) wrote : | # |
@laney, thanks, I've updated the name resolution related retry and dropped the clock skew one because it is not very frequent.
My suspect is systemd-timesyncd, I've left a comment in LP: #1880839.
Iain Lane (laney) wrote : | # |
Still needs fixing to run this code in the case of exit code 2, 4, 6, 8 - second paragraph above.
See
https:/
this is the block that needs updating.
Balint Reczey (rbalint) wrote : | # |
Thanks, indeed.
I think retry should not run for 2 and 8, and I'm not sure about 12, 14 and 20, so I've added 4 and 6 as safe bets.
Iain Lane (laney) wrote : | # |
Sorry Balint, this still needs fixing. :(
I'm not being clear, let me try some more:
There are two types of failure.
1. Permanent failures, which cause a test to be marked as a fail in the database and frontends (e.g. proposed-migration or the autopkgtest.
2. Temporary failures, where autopkgtest thinks that *it* caused the failure, or it's otherwise transient. These are reported by exit code 16 and autopkgtest-cloud queues these to be re-run.
2.5. Sometimes permanent failures are misdetected as temporary ones, so we have this code to convert the two. In that case we override the code 16 to a code 4 ("at least one test failed").
What you want to introduce is a 1.5 that's the kind of opposite of 2.5. We want to convert some kinds of permanent failure into temporary ones, so that they get retried. We do *not* want to generally start retrying all permanent failures, which is what the MP currently would do.
I think that *above* the "if code == 16 ..." line, you should add a check for "if code in (2, 4, 6, 8):", which:
- Greps the log for one of the *new* (different variable) strings that we want to retry on
- If found, return as if it were a temporary failure, so that we retry. Do this up to three times.
Of course that code will be *common* with 2.5, so you can probably move some of that into functions and call it in both places.
- adb8dc4... by Balint Reczey
-
worker: Factor out reading the log file
- 7a27868... by Balint Reczey
-
worker: Retry tests on possibly temporary issues
LP: #1829752
- 77cf6b3... by Balint Reczey
-
worker: Don't log that retry will take place when it won't
Balint Reczey (rbalint) wrote : | # |
@laney Could you please take a look again?
- bc4d686... by Balint Reczey
-
Workaround clock skew on testbeds
LP: #1880839
Iain Lane (laney) wrote : | # |
Alright, I think this looks good, let's try it. Thanks for sticking with it.
Balint Reczey (rbalint) wrote : | # |
@laney Could you please merge it? I can't.
Balint Reczey (rbalint) wrote : | # |
It is actually merged, sorry.
Preview Diff
1 | diff --git a/worker/worker b/worker/worker | |||
2 | index c83be3c..0a77409 100755 | |||
3 | --- a/worker/worker | |||
4 | +++ b/worker/worker | |||
5 | @@ -54,6 +54,12 @@ FAIL_STRINGS = ['Kernel panic - not syncing:', | |||
6 | 54 | 'Out of memory: Kill process', | 54 | 'Out of memory: Kill process', |
7 | 55 | 'error: you need to load the kernel first.', | 55 | 'error: you need to load the kernel first.', |
8 | 56 | 'Faulting instruction address:'] | 56 | 'Faulting instruction address:'] |
9 | 57 | TEMPORARY_TEST_FAIL_STRINGS = ['Temporary failure resolving', | ||
10 | 58 | 'Temporary failure in name resolution', | ||
11 | 59 | ' has modification time '] # clock skew, LP: #1880839 | ||
12 | 60 | |||
13 | 61 | # Some strings occur in passing tests of specific packages | ||
14 | 62 | PASS_PKG_STRINGS = {'systemd*': ['Temporary failure in name resolution']} | ||
15 | 57 | 63 | ||
16 | 58 | # If we repeatedly time out when installing, there's probably a problem with | 64 | # If we repeatedly time out when installing, there's probably a problem with |
17 | 59 | # one of the packages' maintainer scripts. | 65 | # one of the packages' maintainer scripts. |
18 | @@ -258,6 +264,26 @@ def call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, star | |||
19 | 258 | return ret | 264 | return ret |
20 | 259 | 265 | ||
21 | 260 | 266 | ||
22 | 267 | def log_contents(out_dir): | ||
23 | 268 | try: | ||
24 | 269 | with open(os.path.join(out_dir, 'log'), | ||
25 | 270 | encoding='utf-8', | ||
26 | 271 | errors='surrogateescape') as f: | ||
27 | 272 | return f.read() | ||
28 | 273 | except IOError as e: | ||
29 | 274 | logging.error('Could not read log file: %s' % str(e)) | ||
30 | 275 | return "" | ||
31 | 276 | |||
32 | 277 | |||
33 | 278 | def cleanup_and_sleep(out_dir): | ||
34 | 279 | '''Empty the output dir for the next run, otherwise autopkgtest complains''' | ||
35 | 280 | shutil.rmtree(out_dir) | ||
36 | 281 | os.mkdir(out_dir) | ||
37 | 282 | running_test = False | ||
38 | 283 | time.sleep(300) | ||
39 | 284 | running_test = True | ||
40 | 285 | |||
41 | 286 | |||
42 | 261 | def request(msg): | 287 | def request(msg): |
43 | 262 | '''Callback for AMQP queue request''' | 288 | '''Callback for AMQP queue request''' |
44 | 263 | 289 | ||
45 | @@ -531,58 +557,50 @@ def request(msg): | |||
46 | 531 | retry_start_time = time.time() | 557 | retry_start_time = time.time() |
47 | 532 | logging.info('Running %s', ' '.join(argv)) | 558 | logging.info('Running %s', ' '.join(argv)) |
48 | 533 | code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time) | 559 | code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time) |
49 | 560 | if code in (4, 6): | ||
50 | 561 | contents = log_contents(out_dir) | ||
51 | 562 | temp_fails = [s for s in (set(TEMPORARY_TEST_FAIL_STRINGS) | ||
52 | 563 | - set(getglob(PASS_PKG_STRINGS, pkgname, []))) | ||
53 | 564 | if s in contents] | ||
54 | 565 | if temp_fails: | ||
55 | 566 | logging.warning('Saw %s in log, which is a sign of a temporary failure.', | ||
56 | 567 | ' and '.join(fails)) | ||
57 | 568 | logging.warning('%sLog follows:', "Retrying in 5 minutes. " if retry < 2 else "") | ||
58 | 569 | logging.error(contents) | ||
59 | 570 | if retry < 2: | ||
60 | 571 | cleanup_and_sleep(out_dir) | ||
61 | 572 | else: | ||
62 | 573 | break | ||
63 | 574 | |||
64 | 534 | if code == 16 or code < 0: | 575 | if code == 16 or code < 0: |
65 | 576 | contents = log_contents(out_dir) | ||
66 | 535 | if exit_requested is not None: | 577 | if exit_requested is not None: |
67 | 536 | logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested) | 578 | logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested) |
75 | 537 | try: | 579 | logging.error(contents) |
69 | 538 | with open(os.path.join(out_dir, 'log'), | ||
70 | 539 | encoding='utf-8', | ||
71 | 540 | errors='surrogateescape') as f: | ||
72 | 541 | logging.error(f.read()) | ||
73 | 542 | except IOError as e: | ||
74 | 543 | logging.error('Could not read log file: %s' % str(e)) | ||
76 | 544 | sys.exit(exit_requested) | 580 | sys.exit(exit_requested) |
99 | 545 | try: | 581 | # Get the package-specific string for triggers too, since they might have broken the run |
100 | 546 | with open(os.path.join(out_dir, 'log'), | 582 | trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])] |
101 | 547 | encoding='utf-8', | 583 | fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i] |
102 | 548 | errors='surrogateescape') as f: | 584 | |
103 | 549 | contents = f.read() | 585 | # Or if all-proposed, just give up and accept everything |
104 | 550 | # Get the package-specific string for triggers too, since they might have broken the run | 586 | fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i] |
105 | 551 | trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])] | 587 | |
106 | 552 | fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i] | 588 | allowed_fail_strings = set(FAIL_STRINGS + |
107 | 553 | 589 | getglob(FAIL_PKG_STRINGS, pkgname, []) + | |
108 | 554 | # Or if all-proposed, just give up and accept everything | 590 | fail_trigs + |
109 | 555 | fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i] | 591 | (fail_all_proposed if 'all-proposed' in params else [])) \ |
110 | 556 | 592 | - set(getglob(PASS_PKG_STRINGS, pkgname, [])) | |
111 | 557 | allowed_fail_strings = set(FAIL_STRINGS + | 593 | |
112 | 558 | getglob(FAIL_PKG_STRINGS, pkgname, []) + | 594 | fails = [s for s in allowed_fail_strings if s in contents] + \ |
113 | 559 | fail_trigs + | 595 | [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)] |
114 | 560 | (fail_all_proposed if 'all-proposed' in params else [])) | 596 | if fails: |
115 | 561 | 597 | num_failures += 1 | |
116 | 562 | fails = [s for s in allowed_fail_strings if s in contents] + \ | 598 | logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far', |
95 | 563 | [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)] | ||
96 | 564 | if fails: | ||
97 | 565 | num_failures += 1 | ||
98 | 566 | logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far', | ||
117 | 567 | ' and '.join(fails), num_failures) | 599 | ' and '.join(fails), num_failures) |
130 | 568 | except IOError as e: | 600 | logging.warning('Testbed failure%s. Log follows:', ", retrying in 5 minutes" if retry < 2 else "") |
131 | 569 | logging.error('Could not read log file: %s' % str(e)) | 601 | logging.error(contents) |
120 | 570 | logging.warning('Testbed failure, retrying in 5 minutes. Log follows:') | ||
121 | 571 | try: | ||
122 | 572 | with open(os.path.join(out_dir, 'log'), | ||
123 | 573 | encoding='utf-8', | ||
124 | 574 | errors='surrogateescape') as f: | ||
125 | 575 | logging.error(f.read()) | ||
126 | 576 | except IOError as e: | ||
127 | 577 | logging.error('Could not read log file: %s' % str(e)) | ||
128 | 578 | # we need to empty the --output-dir for the next run, otherwise | ||
129 | 579 | # autopkgtest complains | ||
132 | 580 | if retry < 2: | 602 | if retry < 2: |
138 | 581 | shutil.rmtree(out_dir) | 603 | cleanup_and_sleep(out_dir) |
134 | 582 | os.mkdir(out_dir) | ||
135 | 583 | running_test = False | ||
136 | 584 | time.sleep(300) | ||
137 | 585 | running_test = True | ||
139 | 586 | else: | 604 | else: |
140 | 587 | break | 605 | break |
141 | 588 | else: | 606 | else: |
142 | @@ -591,13 +609,7 @@ def request(msg): | |||
143 | 591 | code = 4 | 609 | code = 4 |
144 | 592 | else: | 610 | else: |
145 | 593 | logging.error('Three tmpfails in a row, aborting worker. Log follows:') | 611 | logging.error('Three tmpfails in a row, aborting worker. Log follows:') |
153 | 594 | try: | 612 | logging.error(log_contents(out_dir)) |
147 | 595 | with open(os.path.join(out_dir, 'log'), | ||
148 | 596 | encoding='utf-8', | ||
149 | 597 | errors='surrogateescape') as f: | ||
150 | 598 | logging.error(f.read()) | ||
151 | 599 | except IOError as e: | ||
152 | 600 | logging.error('Could not read log file: %s' % str(e)) | ||
154 | 601 | sys.exit(99) | 613 | sys.exit(99) |
155 | 602 | 614 | ||
156 | 603 | duration = int(time.time() - retry_start_time) | 615 | duration = int(time.time() - retry_start_time) |
Thanks, but sorry, this isn't going to work as-is.
To understand why, find the code where FAIL_STRINGS_REGEX is used. It's exclusively where a run has *temporarily* failed (exitcode == 16). What we had here was a permanent failure (code 4). We need to check that the exitcode is in (2, 4, 6, 8) and then grep for a different set of strings - obviously factoring out the existing logic in the most sensible way to apply to both cases.
Also the 'Temporary failure' should not be restricted to a package - we can get this happening in runs of any test. Those other ones are restricted to systemd* and linux* because they are packages which genuinely can - and do - break booting of the instances completely when they have a bug. autopkgtest isn't great at catching that so we do it outside.
OOI, do you have any ideas why the clock is moving backwards? That feels like something we should fix properly rather than retrying on, if we can understand why it happens and why provisioning hasn't given us the right time (or if our idea of time got out of step with reality...).