Merge ~thibf/uvtool:thibf/uvt-kvm_wait_system_booting into uvtool:main

Proposed by Thibf
Status: Merged
Merged at revision: 35dc66e269abb713472701980cbec8b09ef4677c
Proposed branch: ~thibf/uvtool:thibf/uvt-kvm_wait_system_booting
Merge into: uvtool:main
Diff against target: 73 lines (+34/-21)
1 file modified
uvtool/libvirt/kvm.py (+34/-21)
Reviewer Review Type Date Requested Status
Robie Basak Approve
Brett Holman (community) Approve
Review via email: mp+453689@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Thibf (thibf) wrote :

Tried by doing:
```
while uvt-kvm create test release=focal arch=amd64 && uvt-kvm wait bjf-test --insecure --ssh-private-key-file ~/.ssh/id_rsa; do uvt-kvm destroy test; echo cycle done; done
```

Seems reliable. Tried also ssh command, didn't observe any obvious regression.

Revision history for this message
Robie Basak (racb) wrote :

Thank you for working on this so promptly! Your changes look like they will work, but there are a few things that I think need cleanup up please.

The main issue is that I think that this inappropriately overloads the ssh function. Its definition should not be to ssh but also and also handle its own retries matching a particular string, especially when it's only the wait mechanism that would be using it. And `capture_output` and `retries` are also inappropriately being coupled together. Consider what a description of the function arguments would look like!

Instead, please add support for capture_output to the `ssh` function, let CalledProcessError bubble up to `main_wait_remote`, and handle retry delay there. You will have direct access to args.timeout and args.interval as required.

Some further comments inline.

review: Needs Fixing
Revision history for this message
Robie Basak (racb) wrote :

I appreciate that some of the inline comments are moot given my requested refactoring, but I mentioned them anyway to hopefully give you a better sense of my expectations for next time :)

Revision history for this message
Thibf (thibf) wrote :

> I appreciate that some of the inline comments are moot given my requested
> refactoring, but I mentioned them anyway to hopefully give you a better sense
> of my expectations for next time :)

Make sense!

Revision history for this message
Thibf (thibf) wrote :

> Thank you for working on this so promptly! Your changes look like they will
> work, but there are a few things that I think need cleanup up please.
>
> The main issue is that I think that this inappropriately overloads the ssh
> function. Its definition should not be to ssh but also and also handle its own
> retries matching a particular string, especially when it's only the wait
> mechanism that would be using it. And `capture_output` and `retries` are also
> inappropriately being coupled together. Consider what a description of the
> function arguments would look like!
>
> Instead, please add support for capture_output to the `ssh` function, let
> CalledProcessError bubble up to `main_wait_remote`, and handle retry delay
> there. You will have direct access to args.timeout and args.interval as
> required.
>
> Some further comments inline.

Indeed, this makes sense.
See the update, I think it match what you shared.

I enabled retries only when we find the matching string which is the best in the case of the linked issue.
But it may make sense in the future to always retries for generic errors until timeout occur.

Revision history for this message
Brett Holman (holmanb) wrote :

I haven't tested, but the code looks good to me (pending Robbie's approval).

Thanks for this fix!

review: Approve
Revision history for this message
Robie Basak (racb) wrote :

Thanks, this looks much closer to what I'd like.

> But it may make sense in the future to always retries for generic errors until timeout occur.

"uvt-kvm ssh some-vm false" must not retry though, to mirror the behaviour of "ssh some-ip false". In other words, wherever possible it should pass through the return status of the command it wrapped, and I'm not sure how we could differentiate that from the ssh itself failing effectively enough. See below for more on this!

> I enabled retries only when we find the matching string which is the best in the case of the linked issue.

Therefore I think this behaviour is correct.

Above I asked about how we might differentiate a failure of ssh itself. That caused me to look up what ssh defines its return value to be, and it says 255 on failure. Then I tested /etc/nologin, and found that it does treat that as an ssh failure and the ssh command exits 255 in that case. Sorry I hadn't spotted this before, but perhaps this is better than a string match?

I'm open to opinions on this. Would retrying on 255 from ssh during the wait command be better to do this instead of a string match, or should we do both, or just the string match? I think I favour just the return value match, as ssh failure as opposed to command failure seems cleaner, and being restricted to main_wait_remote() it seems like this logic would fit what wait "is". But have I missed any other case that we need to consider?

One further comment inline.

review: Needs Fixing
Revision history for this message
Thibf (thibf) wrote :

> Thanks, this looks much closer to what I'd like.
>
> > But it may make sense in the future to always retries for generic errors
> until timeout occur.
>
> "uvt-kvm ssh some-vm false" must not retry though, to mirror the behaviour of
> "ssh some-ip false". In other words, wherever possible it should pass through
> the return status of the command it wrapped, and I'm not sure how we could
> differentiate that from the ssh itself failing effectively enough. See below
> for more on this!
>
> > I enabled retries only when we find the matching string which is the best in
> the case of the linked issue.
>
> Therefore I think this behaviour is correct.
>
> Above I asked about how we might differentiate a failure of ssh itself. That
> caused me to look up what ssh defines its return value to be, and it says 255
> on failure. Then I tested /etc/nologin, and found that it does treat that as
> an ssh failure and the ssh command exits 255 in that case. Sorry I hadn't
> spotted this before, but perhaps this is better than a string match?
>
> I'm open to opinions on this. Would retrying on 255 from ssh during the wait
> command be better to do this instead of a string match, or should we do both,
> or just the string match? I think I favour just the return value match, as ssh
> failure as opposed to command failure seems cleaner, and being restricted to
> main_wait_remote() it seems like this logic would fit what wait "is". But have
> I missed any other case that we need to consider?

I think in this case we can "just" rely on the exit code of ssh. This would make this modification very generic and the retry logic work very broadly. We can even skip checking the actual return code and retry as long there is a failure. The timeout is still there to let the user change this behavior, 0 would disable any retries in this case.

I added an explicit error message for the timeout. Should we print the exception thrown by the sub-command ? Might help further debugging but reduce readability.
> One further comment inline.

Revision history for this message
Robie Basak (racb) wrote :

> I think in this case we can "just" rely on the exit code of ssh.

If the remote wait script fails, then we should exit immediately rather than retry. So I think this should be gated on the exit value being 255 only. Any other exit value should result in an immediate failure (eg. by passing CalledProcessError through). Sorry to ask for further changes - I thought this is what I specified above?

> I added an explicit error message for the timeout. Should we print the exception thrown by the sub-command ? Might help further debugging but reduce readability.

This looks good, thanks. I don't think it's worth going into that right now. I'll add one minor comment inline so that this might be more easily enhanced later.

review: Needs Fixing
Revision history for this message
Thibf (thibf) wrote :

> > I think in this case we can "just" rely on the exit code of ssh.
>
> If the remote wait script fails, then we should exit immediately rather than
> retry. So I think this should be gated on the exit value being 255 only. Any
> other exit value should result in an immediate failure (eg. by passing
> CalledProcessError through). Sorry to ask for further changes - I thought this
> is what I specified above?

I didn't get this specific behavior, my bad on this. It's fixed, consequently added a new branch in error handling because we can't print "timeout occured" while it's not the case.
>
> > I added an explicit error message for the timeout. Should we print the
> exception thrown by the sub-command ? Might help further debugging but reduce
> readability.
>
> This looks good, thanks. I don't think it's worth going into that right now.
> I'll add one minor comment inline so that this might be more easily enhanced
> later.

Applied

Revision history for this message
Robie Basak (racb) wrote :

Thanks!

I tested this and noticed that the nologin output appears on the terminal, which is sub-optimal, but that seems like a bit of a rabbithole to fix and we've iterated long enough, so I'm going to merge this as-is. Especially since the cloud-init regression that exposed this (valid) uvtool bug is also planned to be fixed soon, so this will only be temporary for users of Ubuntu stable releases in practice anyway.

We could refactor this in the future to capture the message and replay it unless detecting this condition or something like that.

Next, this needs to be uploaded to Ubuntu and SRU'd.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/uvtool/libvirt/kvm.py b/uvtool/libvirt/kvm.py
2index 218a7f0..1bc0d63 100755
3--- a/uvtool/libvirt/kvm.py
4+++ b/uvtool/libvirt/kvm.py
5@@ -37,6 +37,7 @@ import string
6 import subprocess
7 import sys
8 import tempfile
9+import time
10 import uuid
11 import yaml
12
13@@ -963,27 +964,39 @@ def main_ssh(parser, args, default_login_name='ubuntu'):
14
15 def main_wait_remote(parser, args):
16 with open(args.remote_wait_script, 'rb') as wait_script:
17- try:
18- ssh(
19- args.name,
20- args.remote_wait_user,
21- [
22- 'env',
23- 'UVTOOL_WAIT_INTERVAL=%s' % args.interval,
24- 'UVTOOL_WAIT_TIMEOUT=%s' % args.timeout,
25- 'sh',
26- '-'
27- ],
28- checked=True,
29- stdin=wait_script,
30- private_key_file=args.ssh_private_key_file,
31- insecure=args.insecure,
32- )
33- except InsecureError:
34- raise CLIError(
35- "ssh public host key not found. Use "
36- "--insecure iff you trust your network path to the guest."
37- )
38+ timeout = time.time() + args.timeout
39+ while True:
40+ try:
41+ ssh(
42+ args.name,
43+ args.remote_wait_user,
44+ [
45+ 'env',
46+ 'UVTOOL_WAIT_INTERVAL=%s' % args.interval,
47+ 'UVTOOL_WAIT_TIMEOUT=%s' % args.timeout,
48+ 'sh',
49+ '-'
50+ ],
51+ checked=True,
52+ stdin=wait_script,
53+ private_key_file=args.ssh_private_key_file,
54+ insecure=args.insecure,
55+ )
56+ break
57+ except InsecureError:
58+ raise CLIError(
59+ "ssh public host key not found. Use "
60+ "--insecure iff you trust your network path to the guest."
61+ )
62+ except subprocess.CalledProcessError as e :
63+ if e.returncode == 255:
64+ if time.time() < timeout:
65+ time.sleep(args.interval)
66+ else:
67+ raise CLIError(
68+ "timed out waiting for ssh to open on %s." % args.name) from e
69+ else:
70+ raise e
71
72
73 def main_wait(parser, args):

Subscribers

People subscribed via source and target branches