autopkgtest-cloud

Bug #1829752: Please trigger tests again when they fail due to transient infrastructure issues	Undecided	Fix Released
Bug #1880839: Clock skew on testbeds	Undecided	New

Link a bug report

Reviewer	Review Type	Date Requested	Status
Iain Lane		2020-05-27	Approve on 2020-09-07
Review via email: mp+384609@code.launchpad.net

Revision history for this message

Iain Lane (laney) wrote on 2020-05-27:

Thanks, but sorry, this isn't going to work as-is.

To understand why, find the code where FAIL_STRINGS_REGEX is used. It's exclusively where a run has *temporarily* failed (exitcode == 16). What we had here was a permanent failure (code 4). We need to check that the exitcode is in (2, 4, 6, 8) and then grep for a different set of strings - obviously factoring out the existing logic in the most sensible way to apply to both cases.

Also the 'Temporary failure' should not be restricted to a package - we can get this happening in runs of any test. Those other ones are restricted to systemd* and linux* because they are packages which genuinely can - and do - break booting of the instances completely when they have a bug. autopkgtest isn't great at catching that so we do it outside.

OOI, do you have any ideas why the clock is moving backwards? That feels like something we should fix properly rather than retrying on, if we can understand why it happens and why provisioning hasn't given us the right time (or if our idea of time got out of step with reality...).

review: Needs Fixing

Revision history for this message

Balint Reczey (rbalint) wrote on 2020-05-27:

@laney, thanks, I've updated the name resolution related retry and dropped the clock skew one because it is not very frequent.

My suspect is systemd-timesyncd, I've left a comment in LP: #1880839.

Revision history for this message

Iain Lane (laney) wrote on 2020-05-27:

Still needs fixing to run this code in the case of exit code 2, 4, 6, 8 - second paragraph above.

See

https://git.launchpad.net/autopkgtest-cloud/tree/worker/worker#n534

this is the block that needs updating.

Revision history for this message

Balint Reczey (rbalint) wrote on 2020-06-11:

Thanks, indeed.
I think retry should not run for 2 and 8, and I'm not sure about 12, 14 and 20, so I've added 4 and 6 as safe bets.

Revision history for this message

Iain Lane (laney) wrote on 2020-06-24:

Sorry Balint, this still needs fixing. :(

I'm not being clear, let me try some more:

There are two types of failure.

1. Permanent failures, which cause a test to be marked as a fail in the database and frontends (e.g. proposed-migration or the autopkgtest.ubuntu.com website) display this as a failure.
2. Temporary failures, where autopkgtest thinks that *it* caused the failure, or it's otherwise transient. These are reported by exit code 16 and autopkgtest-cloud queues these to be re-run.
2.5. Sometimes permanent failures are misdetected as temporary ones, so we have this code to convert the two. In that case we override the code 16 to a code 4 ("at least one test failed").

What you want to introduce is a 1.5 that's the kind of opposite of 2.5. We want to convert some kinds of permanent failure into temporary ones, so that they get retried. We do *not* want to generally start retrying all permanent failures, which is what the MP currently would do.

I think that *above* the "if code == 16 ..." line, you should add a check for "if code in (2, 4, 6, 8):", which:

- Greps the log for one of the *new* (different variable) strings that we want to retry on
- If found, return as if it were a temporary failure, so that we retry. Do this up to three times.

Of course that code will be *common* with 2.5, so you can probably move some of that into functions and call it in both places.

review: Needs Fixing

~rbalint/autopkgtest-cloud:more-retries updated on 2020-08-16

adb8dc4... by Balint Reczey on 2020-08-15

worker: Factor out reading the log file

7a27868... by Balint Reczey on 2020-08-15

worker: Retry tests on possibly temporary issues

LP: #1829752

77cf6b3... by Balint Reczey on 2020-08-16

worker: Don't log that retry will take place when it won't

Revision history for this message

Balint Reczey (rbalint) wrote on 2020-08-28:

@laney Could you please take a look again?

~rbalint/autopkgtest-cloud:more-retries updated on 2020-08-29

bc4d686... by Balint Reczey on 2020-08-29

Workaround clock skew on testbeds

LP: #1880839

Revision history for this message

Iain Lane (laney) wrote on 2020-09-07:

Alright, I think this looks good, let's try it. Thanks for sticking with it.

review: Approve

Revision history for this message

Balint Reczey (rbalint) wrote on 2020-09-09:

@laney Could you please merge it? I can't.

Revision history for this message

Balint Reczey (rbalint) wrote on 2020-09-09:

It is actually merged, sorry.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Balint Reczey

Simon Quigley

Ubuntu Release Team

 diff --git a/worker/worker b/worker/worker
 index c83be3c..0a77409 100755
 --- a/worker/worker
 +++ b/worker/worker
@@ -54,6 +54,12 @@ FAIL_STRINGS = ['Kernel panic - not syncing:',
                  'Out of memory: Kill process',
                  'error: you need to load the kernel first.',
                  'Faulting instruction address:']
++TEMPORARY_TEST_FAIL_STRINGS = ['Temporary failure resolving',
++                               'Temporary failure in name resolution',
++                               ' has modification time ']  # clock skew, LP: #1880839
++
++# Some strings occur in passing tests of specific packages
++PASS_PKG_STRINGS = {'systemd*': ['Temporary failure in name resolution']}
  # If we repeatedly time out when installing, there's probably a problem with
  # one of the packages' maintainer scripts.
@@ -258,6 +264,26 @@ def call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, star
      return ret
++def log_contents(out_dir):
++    try:
++        with open(os.path.join(out_dir, 'log'),
++                  encoding='utf-8',
++                  errors='surrogateescape') as f:
++            return f.read()
++    except IOError as e:
++        logging.error('Could not read log file: %s' % str(e))
++        return ""
++
++
++def cleanup_and_sleep(out_dir):
++    '''Empty the output dir for the next run, otherwise autopkgtest complains'''
++    shutil.rmtree(out_dir)
++    os.mkdir(out_dir)
++    running_test = False
++    time.sleep(300)
++    running_test = True
++
++
  def request(msg):
      '''Callback for AMQP queue request'''
@@ -531,58 +557,50 @@ def request(msg):
              retry_start_time = time.time()
              logging.info('Running %s', ' '.join(argv))
              code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time)
++            if code in (4, 6):
++                contents = log_contents(out_dir)
++                temp_fails = [s for s in (set(TEMPORARY_TEST_FAIL_STRINGS)
++                                          - set(getglob(PASS_PKG_STRINGS, pkgname, [])))
++                              if s in contents]
++                if temp_fails:
++                    logging.warning('Saw %s in log, which is a sign of a temporary failure.',
++                                            ' and '.join(fails))
++                    logging.warning('%sLog follows:', "Retrying in 5 minutes. " if retry < 2 else "")
++                    logging.error(contents)
++                    if retry < 2:
++                        cleanup_and_sleep(out_dir)
++                else:
++                    break
++
              if code == 16 or code < 0:
++                contents = log_contents(out_dir)
                  if exit_requested is not None:
                      logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested)
--                    try:
--                        with open(os.path.join(out_dir, 'log'),
--                                  encoding='utf-8',
--                                  errors='surrogateescape') as f:
--                            logging.error(f.read())
--                    except IOError as e:
--                        logging.error('Could not read log file: %s' % str(e))
++                    logging.error(contents)
                      sys.exit(exit_requested)
--                try:
--                    with open(os.path.join(out_dir, 'log'),
--                              encoding='utf-8',
--                              errors='surrogateescape') as f:
--                        contents = f.read()
--                        # Get the package-specific string for triggers too, since they might have broken the run
--                        trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]
--                        fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]
--
--                        # Or if all-proposed, just give up and accept everything
--                        fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]
--
--                        allowed_fail_strings = set(FAIL_STRINGS +
--                                                   getglob(FAIL_PKG_STRINGS, pkgname, []) +
--                                                   fail_trigs +
--                                                   (fail_all_proposed if 'all-proposed' in params else []))
--
--                        fails = [s for s in allowed_fail_strings if s in contents] + \
--                                [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
--                        if fails:
--                            num_failures += 1
--                            logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
++                # Get the package-specific string for triggers too, since they might have broken the run
++                trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]
++                fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]
++
++                # Or if all-proposed, just give up and accept everything
++                fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]
++
++                allowed_fail_strings = set(FAIL_STRINGS +
++                                           getglob(FAIL_PKG_STRINGS, pkgname, []) +
++                                           fail_trigs +
++                                           (fail_all_proposed if 'all-proposed' in params else [])) \
++                                           - set(getglob(PASS_PKG_STRINGS, pkgname, []))
++
++                fails = [s for s in allowed_fail_strings if s in contents] + \
++                    [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
++                if fails:
++                    num_failures += 1
++                    logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
                                              ' and '.join(fails), num_failures)
--                except IOError as e:
--                    logging.error('Could not read log file: %s' % str(e))
--                logging.warning('Testbed failure, retrying in 5 minutes. Log follows:')
--                try:
--                    with open(os.path.join(out_dir, 'log'),
--                              encoding='utf-8',
--                              errors='surrogateescape') as f:
--                        logging.error(f.read())
--                except IOError as e:
--                    logging.error('Could not read log file: %s' % str(e))
--                # we need to empty the --output-dir for the next run, otherwise
--                # autopkgtest complains
++                logging.warning('Testbed failure%s. Log follows:', ", retrying in 5 minutes" if retry < 2 else "")
++                logging.error(contents)
                  if retry < 2:
--                    shutil.rmtree(out_dir)
--                    os.mkdir(out_dir)
--                    running_test = False
--                    time.sleep(300)
--                    running_test = True
++                    cleanup_and_sleep(out_dir)
              else:
                  break
          else:
@@ -591,13 +609,7 @@ def request(msg):
                  code = 4
              else:
                  logging.error('Three tmpfails in a row, aborting worker. Log follows:')
--                try:
--                    with open(os.path.join(out_dir, 'log'),
--                              encoding='utf-8',
--                              errors='surrogateescape') as f:
--                        logging.error(f.read())
--                except IOError as e:
--                    logging.error('Could not read log file: %s' % str(e))
++                logging.error(log_contents(out_dir))
                  sys.exit(99)
          duration = int(time.time() - retry_start_time)

autopkgtest-cloud

Merge ~rbalint/autopkgtest-cloud:more-retries into autopkgtest-cloud:master

Commit message

Description of the change

Preview Diff

Subscribers