Merge ~pwlars/testflinger-cli:safer-polling into testflinger-cli:master

Proposed by Paul Larson
Status: Merged
Approved by: Paul Larson
Approved revision: bb3f35afe75880e368fb69631f8123daa9ba9bf7
Merged at revision: b34625416293ca9bf1291a2e8fceba19676eb63d
Proposed branch: ~pwlars/testflinger-cli:safer-polling
Merge into: testflinger-cli:master
Diff against target: 69 lines (+26/-23)
1 file modified
testflinger-cli (+26/-23)
Reviewer Review Type Date Requested Status
Paul Larson Approve
Review via email: mp+326567@code.launchpad.net

Description of the change

I found even more cases where polling can fail, this time it failed to even get the job_state at the beginning of polling - not due to a bad job_id or anything, just due to timeout. The downside of this is that if something is really stuck, or server continually times out, the jenkins job could be stuck waiting forever (or until the jenkins job timeout is reached). I'm leaning towards that being ok though, because the alternative is pretty annoying when testflinger is still running the test job, but the server timed out a response, so we crash on the jenkins side and don't see proper results from it.

We could also take the approach of "make sure the server never times out", and there may be more we could do there too, but in general, I think tools should handle failure cases gracefully.

To post a comment you must log in.
Revision history for this message
Paul Larson (pwlars) wrote :

I'd like to go ahead and land this and try it in trunk at least. If it seems to work well with our jenkins jobs (which pull from there), then I'll promote the snap to candidate/stable also.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/testflinger-cli b/testflinger-cli
2index 4625ac5..181fc06 100755
3--- a/testflinger-cli
4+++ b/testflinger-cli
5@@ -169,20 +169,10 @@ def artifacts(ctx, job_id, filename):
6 @click.pass_context
7 def poll(ctx, job_id):
8 conn = ctx.obj['conn']
9- try:
10- job_state = conn.get_status(job_id)
11- except testflinger_cli.HTTPError as e:
12- if e.status == 204:
13- print('No data found for that job id. Check the job id to be sure '
14- 'it is correct')
15- elif e.status == 400:
16- print('Invalid job id specified. Check the job id to be sure it '
17- 'is correct')
18- if e.status == 404:
19- print('Received 404 error from server. Are you sure this '
20- 'is a testflinger server?')
21- sys.exit(1)
22- while True:
23+ job_state = get_job_state(conn, job_id)
24+ while job_state != 'complete':
25+ print('sleeping, job state was {}'.format(job_state))
26+ time.sleep(10)
27 output = ''
28 try:
29 output = conn.get_output(job_id)
30@@ -194,17 +184,30 @@ def poll(ctx, job_id):
31 continue
32 if output:
33 print(output, end='', flush=True)
34- try:
35- job_state = conn.get_status(job_id)
36- except:
37- # If something breaks here, just retry so we don't affect
38- # a running test monitor that relies on poll
39- continue
40- if job_state == 'complete':
41- break
42- time.sleep(10)
43+ job_state = get_job_state(conn, job_id)
44 print(job_state)
45
46
47+def get_job_state(conn, job_id):
48+ try:
49+ return conn.get_status(job_id)
50+ except testflinger_cli.HTTPError as e:
51+ if e.status == 204:
52+ print('No data found for that job id. Check the job id to be sure '
53+ 'it is correct')
54+ elif e.status == 400:
55+ print('Invalid job id specified. Check the job id to be sure it '
56+ 'is correct')
57+ if e.status == 404:
58+ print('Received 404 error from server. Are you sure this '
59+ 'is a testflinger server?')
60+ sys.exit(1)
61+ except:
62+ # If we fail to get the job_state here, it could be because of timeout
63+ # but we can keep going and retrying
64+ pass
65+ return 'unknown'
66+
67+
68 if __name__ == '__main__':
69 cli(obj={})

Subscribers

People subscribed via source and target branches