Merge ~rbalint/autopkgtest-cloud:more-retries into autopkgtest-cloud:master

Proposed by Balint Reczey on 2020-05-27
Status: Merged
Merge reported by: Balint Reczey
Merged at revision: bc4d6865bf81ffd8f5958b5dabc503fef9289f94
Proposed branch: ~rbalint/autopkgtest-cloud:more-retries
Merge into: autopkgtest-cloud:master
Diff against target: 156 lines (+65/-53)
1 file modified
worker/worker (+65/-53)
Reviewer Review Type Date Requested Status
Iain Lane 2020-05-27 Approve on 2020-09-07
Review via email: mp+384609@code.launchpad.net
To post a comment you must log in.
Iain Lane (laney) wrote :

Thanks, but sorry, this isn't going to work as-is.

To understand why, find the code where FAIL_STRINGS_REGEX is used. It's exclusively where a run has *temporarily* failed (exitcode == 16). What we had here was a permanent failure (code 4). We need to check that the exitcode is in (2, 4, 6, 8) and then grep for a different set of strings - obviously factoring out the existing logic in the most sensible way to apply to both cases.

Also the 'Temporary failure' should not be restricted to a package - we can get this happening in runs of any test. Those other ones are restricted to systemd* and linux* because they are packages which genuinely can - and do - break booting of the instances completely when they have a bug. autopkgtest isn't great at catching that so we do it outside.

OOI, do you have any ideas why the clock is moving backwards? That feels like something we should fix properly rather than retrying on, if we can understand why it happens and why provisioning hasn't given us the right time (or if our idea of time got out of step with reality...).

review: Needs Fixing
Balint Reczey (rbalint) wrote :

@laney, thanks, I've updated the name resolution related retry and dropped the clock skew one because it is not very frequent.

My suspect is systemd-timesyncd, I've left a comment in LP: #1880839.

Iain Lane (laney) wrote :

Still needs fixing to run this code in the case of exit code 2, 4, 6, 8 - second paragraph above.

See

https://git.launchpad.net/autopkgtest-cloud/tree/worker/worker#n534

this is the block that needs updating.

Balint Reczey (rbalint) wrote :

Thanks, indeed.
I think retry should not run for 2 and 8, and I'm not sure about 12, 14 and 20, so I've added 4 and 6 as safe bets.

Iain Lane (laney) wrote :

Sorry Balint, this still needs fixing. :(

I'm not being clear, let me try some more:

There are two types of failure.

1. Permanent failures, which cause a test to be marked as a fail in the database and frontends (e.g. proposed-migration or the autopkgtest.ubuntu.com website) display this as a failure.
2. Temporary failures, where autopkgtest thinks that *it* caused the failure, or it's otherwise transient. These are reported by exit code 16 and autopkgtest-cloud queues these to be re-run.
  2.5. Sometimes permanent failures are misdetected as temporary ones, so we have this code to convert the two. In that case we override the code 16 to a code 4 ("at least one test failed").

What you want to introduce is a 1.5 that's the kind of opposite of 2.5. We want to convert some kinds of permanent failure into temporary ones, so that they get retried. We do *not* want to generally start retrying all permanent failures, which is what the MP currently would do.

I think that *above* the "if code == 16 ..." line, you should add a check for "if code in (2, 4, 6, 8):", which:

  - Greps the log for one of the *new* (different variable) strings that we want to retry on
  - If found, return as if it were a temporary failure, so that we retry. Do this up to three times.

Of course that code will be *common* with 2.5, so you can probably move some of that into functions and call it in both places.

review: Needs Fixing
adb8dc4... by Balint Reczey on 2020-08-15

worker: Factor out reading the log file

7a27868... by Balint Reczey on 2020-08-15

worker: Retry tests on possibly temporary issues

LP: #1829752

77cf6b3... by Balint Reczey on 2020-08-16

worker: Don't log that retry will take place when it won't

Balint Reczey (rbalint) wrote :

@laney Could you please take a look again?

bc4d686... by Balint Reczey on 2020-08-29

Workaround clock skew on testbeds

LP: #1880839

Iain Lane (laney) wrote :

Alright, I think this looks good, let's try it. Thanks for sticking with it.

review: Approve
Balint Reczey (rbalint) wrote :

@laney Could you please merge it? I can't.

Balint Reczey (rbalint) wrote :

It is actually merged, sorry.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
diff --git a/worker/worker b/worker/worker
index c83be3c..0a77409 100755
--- a/worker/worker
+++ b/worker/worker
@@ -54,6 +54,12 @@ FAIL_STRINGS = ['Kernel panic - not syncing:',
54 'Out of memory: Kill process',54 'Out of memory: Kill process',
55 'error: you need to load the kernel first.',55 'error: you need to load the kernel first.',
56 'Faulting instruction address:']56 'Faulting instruction address:']
57TEMPORARY_TEST_FAIL_STRINGS = ['Temporary failure resolving',
58 'Temporary failure in name resolution',
59 ' has modification time '] # clock skew, LP: #1880839
60
61# Some strings occur in passing tests of specific packages
62PASS_PKG_STRINGS = {'systemd*': ['Temporary failure in name resolution']}
5763
58# If we repeatedly time out when installing, there's probably a problem with64# If we repeatedly time out when installing, there's probably a problem with
59# one of the packages' maintainer scripts.65# one of the packages' maintainer scripts.
@@ -258,6 +264,26 @@ def call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, star
258 return ret264 return ret
259265
260266
267def log_contents(out_dir):
268 try:
269 with open(os.path.join(out_dir, 'log'),
270 encoding='utf-8',
271 errors='surrogateescape') as f:
272 return f.read()
273 except IOError as e:
274 logging.error('Could not read log file: %s' % str(e))
275 return ""
276
277
278def cleanup_and_sleep(out_dir):
279 '''Empty the output dir for the next run, otherwise autopkgtest complains'''
280 shutil.rmtree(out_dir)
281 os.mkdir(out_dir)
282 running_test = False
283 time.sleep(300)
284 running_test = True
285
286
261def request(msg):287def request(msg):
262 '''Callback for AMQP queue request'''288 '''Callback for AMQP queue request'''
263289
@@ -531,58 +557,50 @@ def request(msg):
531 retry_start_time = time.time()557 retry_start_time = time.time()
532 logging.info('Running %s', ' '.join(argv))558 logging.info('Running %s', ' '.join(argv))
533 code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time)559 code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time)
560 if code in (4, 6):
561 contents = log_contents(out_dir)
562 temp_fails = [s for s in (set(TEMPORARY_TEST_FAIL_STRINGS)
563 - set(getglob(PASS_PKG_STRINGS, pkgname, [])))
564 if s in contents]
565 if temp_fails:
566 logging.warning('Saw %s in log, which is a sign of a temporary failure.',
567 ' and '.join(fails))
568 logging.warning('%sLog follows:', "Retrying in 5 minutes. " if retry < 2 else "")
569 logging.error(contents)
570 if retry < 2:
571 cleanup_and_sleep(out_dir)
572 else:
573 break
574
534 if code == 16 or code < 0:575 if code == 16 or code < 0:
576 contents = log_contents(out_dir)
535 if exit_requested is not None:577 if exit_requested is not None:
536 logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested)578 logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested)
537 try:579 logging.error(contents)
538 with open(os.path.join(out_dir, 'log'),
539 encoding='utf-8',
540 errors='surrogateescape') as f:
541 logging.error(f.read())
542 except IOError as e:
543 logging.error('Could not read log file: %s' % str(e))
544 sys.exit(exit_requested)580 sys.exit(exit_requested)
545 try:581 # Get the package-specific string for triggers too, since they might have broken the run
546 with open(os.path.join(out_dir, 'log'),582 trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]
547 encoding='utf-8',583 fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]
548 errors='surrogateescape') as f:584
549 contents = f.read()585 # Or if all-proposed, just give up and accept everything
550 # Get the package-specific string for triggers too, since they might have broken the run586 fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]
551 trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]587
552 fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]588 allowed_fail_strings = set(FAIL_STRINGS +
553589 getglob(FAIL_PKG_STRINGS, pkgname, []) +
554 # Or if all-proposed, just give up and accept everything590 fail_trigs +
555 fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]591 (fail_all_proposed if 'all-proposed' in params else [])) \
556592 - set(getglob(PASS_PKG_STRINGS, pkgname, []))
557 allowed_fail_strings = set(FAIL_STRINGS +593
558 getglob(FAIL_PKG_STRINGS, pkgname, []) +594 fails = [s for s in allowed_fail_strings if s in contents] + \
559 fail_trigs +595 [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
560 (fail_all_proposed if 'all-proposed' in params else []))596 if fails:
561597 num_failures += 1
562 fails = [s for s in allowed_fail_strings if s in contents] + \598 logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
563 [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
564 if fails:
565 num_failures += 1
566 logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
567 ' and '.join(fails), num_failures)599 ' and '.join(fails), num_failures)
568 except IOError as e:600 logging.warning('Testbed failure%s. Log follows:', ", retrying in 5 minutes" if retry < 2 else "")
569 logging.error('Could not read log file: %s' % str(e))601 logging.error(contents)
570 logging.warning('Testbed failure, retrying in 5 minutes. Log follows:')
571 try:
572 with open(os.path.join(out_dir, 'log'),
573 encoding='utf-8',
574 errors='surrogateescape') as f:
575 logging.error(f.read())
576 except IOError as e:
577 logging.error('Could not read log file: %s' % str(e))
578 # we need to empty the --output-dir for the next run, otherwise
579 # autopkgtest complains
580 if retry < 2:602 if retry < 2:
581 shutil.rmtree(out_dir)603 cleanup_and_sleep(out_dir)
582 os.mkdir(out_dir)
583 running_test = False
584 time.sleep(300)
585 running_test = True
586 else:604 else:
587 break605 break
588 else:606 else:
@@ -591,13 +609,7 @@ def request(msg):
591 code = 4609 code = 4
592 else:610 else:
593 logging.error('Three tmpfails in a row, aborting worker. Log follows:')611 logging.error('Three tmpfails in a row, aborting worker. Log follows:')
594 try:612 logging.error(log_contents(out_dir))
595 with open(os.path.join(out_dir, 'log'),
596 encoding='utf-8',
597 errors='surrogateescape') as f:
598 logging.error(f.read())
599 except IOError as e:
600 logging.error('Could not read log file: %s' % str(e))
601 sys.exit(99)613 sys.exit(99)
602614
603 duration = int(time.time() - retry_start_time)615 duration = int(time.time() - retry_start_time)

Subscribers

People subscribed via source and target branches