Merge ~rbalint/autopkgtest-cloud:more-retries into autopkgtest-cloud:master

Proposed by Balint Reczey on 2020-05-27
Status: Merged
Merge reported by: Balint Reczey
Merged at revision: bc4d6865bf81ffd8f5958b5dabc503fef9289f94
Proposed branch: ~rbalint/autopkgtest-cloud:more-retries
Merge into: autopkgtest-cloud:master
Diff against target: 156 lines (+65/-53)
1 file modified
worker/worker (+65/-53)
Reviewer Review Type Date Requested Status
Iain Lane 2020-05-27 Approve on 2020-09-07
Review via email: mp+384609@code.launchpad.net
To post a comment you must log in.
Iain Lane (laney) wrote :

Thanks, but sorry, this isn't going to work as-is.

To understand why, find the code where FAIL_STRINGS_REGEX is used. It's exclusively where a run has *temporarily* failed (exitcode == 16). What we had here was a permanent failure (code 4). We need to check that the exitcode is in (2, 4, 6, 8) and then grep for a different set of strings - obviously factoring out the existing logic in the most sensible way to apply to both cases.

Also the 'Temporary failure' should not be restricted to a package - we can get this happening in runs of any test. Those other ones are restricted to systemd* and linux* because they are packages which genuinely can - and do - break booting of the instances completely when they have a bug. autopkgtest isn't great at catching that so we do it outside.

OOI, do you have any ideas why the clock is moving backwards? That feels like something we should fix properly rather than retrying on, if we can understand why it happens and why provisioning hasn't given us the right time (or if our idea of time got out of step with reality...).

review: Needs Fixing
Balint Reczey (rbalint) wrote :

@laney, thanks, I've updated the name resolution related retry and dropped the clock skew one because it is not very frequent.

My suspect is systemd-timesyncd, I've left a comment in LP: #1880839.

Iain Lane (laney) wrote :

Still needs fixing to run this code in the case of exit code 2, 4, 6, 8 - second paragraph above.

See

https://git.launchpad.net/autopkgtest-cloud/tree/worker/worker#n534

this is the block that needs updating.

Balint Reczey (rbalint) wrote :

Thanks, indeed.
I think retry should not run for 2 and 8, and I'm not sure about 12, 14 and 20, so I've added 4 and 6 as safe bets.

Iain Lane (laney) wrote :

Sorry Balint, this still needs fixing. :(

I'm not being clear, let me try some more:

There are two types of failure.

1. Permanent failures, which cause a test to be marked as a fail in the database and frontends (e.g. proposed-migration or the autopkgtest.ubuntu.com website) display this as a failure.
2. Temporary failures, where autopkgtest thinks that *it* caused the failure, or it's otherwise transient. These are reported by exit code 16 and autopkgtest-cloud queues these to be re-run.
  2.5. Sometimes permanent failures are misdetected as temporary ones, so we have this code to convert the two. In that case we override the code 16 to a code 4 ("at least one test failed").

What you want to introduce is a 1.5 that's the kind of opposite of 2.5. We want to convert some kinds of permanent failure into temporary ones, so that they get retried. We do *not* want to generally start retrying all permanent failures, which is what the MP currently would do.

I think that *above* the "if code == 16 ..." line, you should add a check for "if code in (2, 4, 6, 8):", which:

  - Greps the log for one of the *new* (different variable) strings that we want to retry on
  - If found, return as if it were a temporary failure, so that we retry. Do this up to three times.

Of course that code will be *common* with 2.5, so you can probably move some of that into functions and call it in both places.

review: Needs Fixing
adb8dc4... by Balint Reczey on 2020-08-15

worker: Factor out reading the log file

7a27868... by Balint Reczey on 2020-08-15

worker: Retry tests on possibly temporary issues

LP: #1829752

77cf6b3... by Balint Reczey on 2020-08-16

worker: Don't log that retry will take place when it won't

Balint Reczey (rbalint) wrote :

@laney Could you please take a look again?

bc4d686... by Balint Reczey on 2020-08-29

Workaround clock skew on testbeds

LP: #1880839

Iain Lane (laney) wrote :

Alright, I think this looks good, let's try it. Thanks for sticking with it.

review: Approve
Balint Reczey (rbalint) wrote :

@laney Could you please merge it? I can't.

Balint Reczey (rbalint) wrote :

It is actually merged, sorry.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/worker/worker b/worker/worker
2index c83be3c..0a77409 100755
3--- a/worker/worker
4+++ b/worker/worker
5@@ -54,6 +54,12 @@ FAIL_STRINGS = ['Kernel panic - not syncing:',
6 'Out of memory: Kill process',
7 'error: you need to load the kernel first.',
8 'Faulting instruction address:']
9+TEMPORARY_TEST_FAIL_STRINGS = ['Temporary failure resolving',
10+ 'Temporary failure in name resolution',
11+ ' has modification time '] # clock skew, LP: #1880839
12+
13+# Some strings occur in passing tests of specific packages
14+PASS_PKG_STRINGS = {'systemd*': ['Temporary failure in name resolution']}
15
16 # If we repeatedly time out when installing, there's probably a problem with
17 # one of the packages' maintainer scripts.
18@@ -258,6 +264,26 @@ def call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, star
19 return ret
20
21
22+def log_contents(out_dir):
23+ try:
24+ with open(os.path.join(out_dir, 'log'),
25+ encoding='utf-8',
26+ errors='surrogateescape') as f:
27+ return f.read()
28+ except IOError as e:
29+ logging.error('Could not read log file: %s' % str(e))
30+ return ""
31+
32+
33+def cleanup_and_sleep(out_dir):
34+ '''Empty the output dir for the next run, otherwise autopkgtest complains'''
35+ shutil.rmtree(out_dir)
36+ os.mkdir(out_dir)
37+ running_test = False
38+ time.sleep(300)
39+ running_test = True
40+
41+
42 def request(msg):
43 '''Callback for AMQP queue request'''
44
45@@ -531,58 +557,50 @@ def request(msg):
46 retry_start_time = time.time()
47 logging.info('Running %s', ' '.join(argv))
48 code = call_autopkgtest(argv, release, architecture, pkgname, params, out_dir, start_time)
49+ if code in (4, 6):
50+ contents = log_contents(out_dir)
51+ temp_fails = [s for s in (set(TEMPORARY_TEST_FAIL_STRINGS)
52+ - set(getglob(PASS_PKG_STRINGS, pkgname, [])))
53+ if s in contents]
54+ if temp_fails:
55+ logging.warning('Saw %s in log, which is a sign of a temporary failure.',
56+ ' and '.join(fails))
57+ logging.warning('%sLog follows:', "Retrying in 5 minutes. " if retry < 2 else "")
58+ logging.error(contents)
59+ if retry < 2:
60+ cleanup_and_sleep(out_dir)
61+ else:
62+ break
63+
64 if code == 16 or code < 0:
65+ contents = log_contents(out_dir)
66 if exit_requested is not None:
67 logging.warning('Testbed failure and exit %i requested. Log follows:', exit_requested)
68- try:
69- with open(os.path.join(out_dir, 'log'),
70- encoding='utf-8',
71- errors='surrogateescape') as f:
72- logging.error(f.read())
73- except IOError as e:
74- logging.error('Could not read log file: %s' % str(e))
75+ logging.error(contents)
76 sys.exit(exit_requested)
77- try:
78- with open(os.path.join(out_dir, 'log'),
79- encoding='utf-8',
80- errors='surrogateescape') as f:
81- contents = f.read()
82- # Get the package-specific string for triggers too, since they might have broken the run
83- trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]
84- fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]
85-
86- # Or if all-proposed, just give up and accept everything
87- fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]
88-
89- allowed_fail_strings = set(FAIL_STRINGS +
90- getglob(FAIL_PKG_STRINGS, pkgname, []) +
91- fail_trigs +
92- (fail_all_proposed if 'all-proposed' in params else []))
93-
94- fails = [s for s in allowed_fail_strings if s in contents] + \
95- [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
96- if fails:
97- num_failures += 1
98- logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
99+ # Get the package-specific string for triggers too, since they might have broken the run
100+ trigs = [t.split('/', 1)[0] for t in params.get('triggers', [])]
101+ fail_trigs = [j for i in [getglob(FAIL_PKG_STRINGS, trig, []) for trig in trigs] for j in i]
102+
103+ # Or if all-proposed, just give up and accept everything
104+ fail_all_proposed = [j for i in FAIL_PKG_STRINGS.values() for j in i]
105+
106+ allowed_fail_strings = set(FAIL_STRINGS +
107+ getglob(FAIL_PKG_STRINGS, pkgname, []) +
108+ fail_trigs +
109+ (fail_all_proposed if 'all-proposed' in params else [])) \
110+ - set(getglob(PASS_PKG_STRINGS, pkgname, []))
111+
112+ fails = [s for s in allowed_fail_strings if s in contents] + \
113+ [s for s in FAIL_STRINGS_REGEX if re.search(s, contents)]
114+ if fails:
115+ num_failures += 1
116+ logging.warning('Saw %s in log, which is a sign of a real (not tmp) failure - seen %d so far',
117 ' and '.join(fails), num_failures)
118- except IOError as e:
119- logging.error('Could not read log file: %s' % str(e))
120- logging.warning('Testbed failure, retrying in 5 minutes. Log follows:')
121- try:
122- with open(os.path.join(out_dir, 'log'),
123- encoding='utf-8',
124- errors='surrogateescape') as f:
125- logging.error(f.read())
126- except IOError as e:
127- logging.error('Could not read log file: %s' % str(e))
128- # we need to empty the --output-dir for the next run, otherwise
129- # autopkgtest complains
130+ logging.warning('Testbed failure%s. Log follows:', ", retrying in 5 minutes" if retry < 2 else "")
131+ logging.error(contents)
132 if retry < 2:
133- shutil.rmtree(out_dir)
134- os.mkdir(out_dir)
135- running_test = False
136- time.sleep(300)
137- running_test = True
138+ cleanup_and_sleep(out_dir)
139 else:
140 break
141 else:
142@@ -591,13 +609,7 @@ def request(msg):
143 code = 4
144 else:
145 logging.error('Three tmpfails in a row, aborting worker. Log follows:')
146- try:
147- with open(os.path.join(out_dir, 'log'),
148- encoding='utf-8',
149- errors='surrogateescape') as f:
150- logging.error(f.read())
151- except IOError as e:
152- logging.error('Could not read log file: %s' % str(e))
153+ logging.error(log_contents(out_dir))
154 sys.exit(99)
155
156 duration = int(time.time() - retry_start_time)

Subscribers

People subscribed via source and target branches