Merge into main : thibf/uvt-kvm_wait_system_booting : lp:~thibf/uvtool : Git : Code : uvtool

Reviewer	Date Requested	Status
Robie Basak	2023-10-16	Approve on 2023-10-26
Brett Holman (community)		Approve on 2023-10-18
Review via email: mp+453689@code.launchpad.net

Revision history for this message

Thibf (thibf) wrote on 2023-10-16:

#

Tried by doing:
```
while uvt-kvm create test release=focal arch=amd64 && uvt-kvm wait bjf-test --insecure --ssh-private-key-file ~/.ssh/id_rsa; do uvt-kvm destroy test; echo cycle done; done
```

Seems reliable. Tried also ssh command, didn't observe any obvious regression.

Revision history for this message

Robie Basak (racb) wrote on 2023-10-17:

#

Thank you for working on this so promptly! Your changes look like they will work, but there are a few things that I think need cleanup up please.

The main issue is that I think that this inappropriately overloads the ssh function. Its definition should not be to ssh but also and also handle its own retries matching a particular string, especially when it's only the wait mechanism that would be using it. And `capture_output` and `retries` are also inappropriately being coupled together. Consider what a description of the function arguments would look like!

Instead, please add support for capture_output to the `ssh` function, let CalledProcessError bubble up to `main_wait_remote`, and handle retry delay there. You will have direct access to args.timeout and args.interval as required.

Some further comments inline.

review: Needs Fixing

Revision history for this message

Robie Basak (racb) wrote on 2023-10-17:

#

I appreciate that some of the inline comments are moot given my requested refactoring, but I mentioned them anyway to hopefully give you a better sense of my expectations for next time :)

Revision history for this message

Thibf (thibf) wrote on 2023-10-18:

#

> I appreciate that some of the inline comments are moot given my requested
> refactoring, but I mentioned them anyway to hopefully give you a better sense
> of my expectations for next time :)

Make sense!

Revision history for this message

Thibf (thibf) wrote on 2023-10-18:

#

> Thank you for working on this so promptly! Your changes look like they will
> work, but there are a few things that I think need cleanup up please.
>
> The main issue is that I think that this inappropriately overloads the ssh
> function. Its definition should not be to ssh but also and also handle its own
> retries matching a particular string, especially when it's only the wait
> mechanism that would be using it. And `capture_output` and `retries` are also
> inappropriately being coupled together. Consider what a description of the
> function arguments would look like!
>
> Instead, please add support for capture_output to the `ssh` function, let
> CalledProcessError bubble up to `main_wait_remote`, and handle retry delay
> there. You will have direct access to args.timeout and args.interval as
> required.
>
> Some further comments inline.

Indeed, this makes sense.
See the update, I think it match what you shared.

I enabled retries only when we find the matching string which is the best in the case of the linked issue.
But it may make sense in the future to always retries for generic errors until timeout occur.

Revision history for this message

Brett Holman (holmanb) wrote on 2023-10-18:

#

I haven't tested, but the code looks good to me (pending Robbie's approval).

Thanks for this fix!

review: Approve

Revision history for this message

Robie Basak (racb) wrote on 2023-10-20:

#

Thanks, this looks much closer to what I'd like.

> But it may make sense in the future to always retries for generic errors until timeout occur.

"uvt-kvm ssh some-vm false" must not retry though, to mirror the behaviour of "ssh some-ip false". In other words, wherever possible it should pass through the return status of the command it wrapped, and I'm not sure how we could differentiate that from the ssh itself failing effectively enough. See below for more on this!

> I enabled retries only when we find the matching string which is the best in the case of the linked issue.

Therefore I think this behaviour is correct.

Above I asked about how we might differentiate a failure of ssh itself. That caused me to look up what ssh defines its return value to be, and it says 255 on failure. Then I tested /etc/nologin, and found that it does treat that as an ssh failure and the ssh command exits 255 in that case. Sorry I hadn't spotted this before, but perhaps this is better than a string match?

I'm open to opinions on this. Would retrying on 255 from ssh during the wait command be better to do this instead of a string match, or should we do both, or just the string match? I think I favour just the return value match, as ssh failure as opposed to command failure seems cleaner, and being restricted to main_wait_remote() it seems like this logic would fit what wait "is". But have I missed any other case that we need to consider?

One further comment inline.

review: Needs Fixing

Revision history for this message

Thibf (thibf) wrote on 2023-10-23:

#

> Thanks, this looks much closer to what I'd like.
>
> > But it may make sense in the future to always retries for generic errors
> until timeout occur.
>
> "uvt-kvm ssh some-vm false" must not retry though, to mirror the behaviour of
> "ssh some-ip false". In other words, wherever possible it should pass through
> the return status of the command it wrapped, and I'm not sure how we could
> differentiate that from the ssh itself failing effectively enough. See below
> for more on this!
>
> > I enabled retries only when we find the matching string which is the best in
> the case of the linked issue.
>
> Therefore I think this behaviour is correct.
>
> Above I asked about how we might differentiate a failure of ssh itself. That
> caused me to look up what ssh defines its return value to be, and it says 255
> on failure. Then I tested /etc/nologin, and found that it does treat that as
> an ssh failure and the ssh command exits 255 in that case. Sorry I hadn't
> spotted this before, but perhaps this is better than a string match?
>
> I'm open to opinions on this. Would retrying on 255 from ssh during the wait
> command be better to do this instead of a string match, or should we do both,
> or just the string match? I think I favour just the return value match, as ssh
> failure as opposed to command failure seems cleaner, and being restricted to
> main_wait_remote() it seems like this logic would fit what wait "is". But have
> I missed any other case that we need to consider?

I think in this case we can "just" rely on the exit code of ssh. This would make this modification very generic and the retry logic work very broadly. We can even skip checking the actual return code and retry as long there is a failure. The timeout is still there to let the user change this behavior, 0 would disable any retries in this case.

I added an explicit error message for the timeout. Should we print the exception thrown by the sub-command ? Might help further debugging but reduce readability.
> One further comment inline.

> Thanks, this looks much closer to what I'd like.
> 
> > But it may make sense in the future to always retries for generic errors
> until timeout occur.
> 
> "uvt-kvm ssh some-vm false" must not retry though, to mirror the behaviour of
> "ssh some-ip false". In other words, wherever possible it should pass through
> the return status of the command it wrapped, and I'm not sure how we could
> differentiate that from the ssh itself failing effectively enough. See below
> for more on this!
> 
> > I enabled retries only when we find the matching string which is the best in
> the case of the linked issue.
> 
> Therefore I think this behaviour is correct.
> 
> Above I asked about how we might differentiate a failure of ssh itself. That
> caused me to look up what ssh defines its return value to be, and it says 255
> on failure. Then I tested /etc/nologin, and found that it does treat that as
> an ssh failure and the ssh command exits 255 in that case. Sorry I hadn't
> spotted this before, but perhaps this is better than a string match?
> 
> I'm open to opinions on this. Would retrying on 255 from ssh during the wait
> command be better to do this instead of a string match, or should we do both,
> or just the string match? I think I favour just the return value match, as ssh
> failure as opposed to command failure seems cleaner, and being restricted to
> main_wait_remote() it seems like this logic would fit what wait "is". But have
> I missed any other case that we need to consider?

I think in this case we can "just" rely on the exit code of ssh. This would make this modification very generic and the retry logic work very broadly. We can even skip checking the actual return code and retry as long there is a failure. The timeout is still there to let the user change this behavior, 0 would disable any retries in this case.

I added an explicit error message for the timeout. Should we print the exception thrown by the sub-command ? Might help further debugging but reduce readability.
> One further comment inline.

Revision history for this message

Robie Basak (racb) wrote on 2023-10-25:

#

> I think in this case we can "just" rely on the exit code of ssh.

If the remote wait script fails, then we should exit immediately rather than retry. So I think this should be gated on the exit value being 255 only. Any other exit value should result in an immediate failure (eg. by passing CalledProcessError through). Sorry to ask for further changes - I thought this is what I specified above?

> I added an explicit error message for the timeout. Should we print the exception thrown by the sub-command ? Might help further debugging but reduce readability.

This looks good, thanks. I don't think it's worth going into that right now. I'll add one minor comment inline so that this might be more easily enhanced later.

review: Needs Fixing

Revision history for this message

Thibf (thibf) wrote on 2023-10-25:

#

> > I think in this case we can "just" rely on the exit code of ssh.
>
> If the remote wait script fails, then we should exit immediately rather than
> retry. So I think this should be gated on the exit value being 255 only. Any
> other exit value should result in an immediate failure (eg. by passing
> CalledProcessError through). Sorry to ask for further changes - I thought this
> is what I specified above?

I didn't get this specific behavior, my bad on this. It's fixed, consequently added a new branch in error handling because we can't print "timeout occured" while it's not the case.
>
> > I added an explicit error message for the timeout. Should we print the
> exception thrown by the sub-command ? Might help further debugging but reduce
> readability.
>
> This looks good, thanks. I don't think it's worth going into that right now.
> I'll add one minor comment inline so that this might be more easily enhanced
> later.

Applied

Revision history for this message

Robie Basak (racb) wrote on 2023-10-26:

#

Thanks!

I tested this and noticed that the nologin output appears on the terminal, which is sub-optimal, but that seems like a bit of a rabbithole to fix and we've iterated long enough, so I'm going to merge this as-is. Especially since the cloud-init regression that exposed this (valid) uvtool bug is also planned to be fixed soon, so this will only be temporary for users of Ubuntu stable releases in practice anyway.

We could refactor this in the future to capture the message and replay it unless detecting this condition or something like that.

Next, this needs to be uploaded to Ubuntu and SRU'd.

review: Approve

 diff --git a/uvtool/libvirt/kvm.py b/uvtool/libvirt/kvm.py
 index 218a7f0..1bc0d63 100755
 --- a/uvtool/libvirt/kvm.py
 +++ b/uvtool/libvirt/kvm.py
@@ -37,6 +37,7 @@ import string
  import subprocess
  import sys
  import tempfile
++import time
  import uuid
  import yaml
@@ -963,27 +964,39 @@ def main_ssh(parser, args, default_login_name='ubuntu'):
  def main_wait_remote(parser, args):
      with open(args.remote_wait_script, 'rb') as wait_script:
--        try:
--            ssh(
--                args.name,
--                args.remote_wait_user,
--                [
--                    'env',
--                    'UVTOOL_WAIT_INTERVAL=%s' % args.interval,
--                    'UVTOOL_WAIT_TIMEOUT=%s' % args.timeout,
--                    'sh',
--                    '-'
--                ],
--                checked=True,
--                stdin=wait_script,
--                private_key_file=args.ssh_private_key_file,
--                insecure=args.insecure,
--            )
--        except InsecureError:
--            raise CLIError(
--                "ssh public host key not found. Use "
--                    "--insecure iff you trust your network path to the guest."
--            )
++        timeout = time.time() + args.timeout
++        while True:
++            try:
++                ssh(
++                    args.name,
++                    args.remote_wait_user,
++                    [
++                        'env',
++                        'UVTOOL_WAIT_INTERVAL=%s' % args.interval,
++                        'UVTOOL_WAIT_TIMEOUT=%s' % args.timeout,
++                        'sh',
++                        '-'
++                    ],
++                    checked=True,
++                    stdin=wait_script,
++                    private_key_file=args.ssh_private_key_file,
++                    insecure=args.insecure,
++                )
++                break
++            except InsecureError:
++                raise CLIError(
++                    "ssh public host key not found. Use "
++                        "--insecure iff you trust your network path to the guest."
++                )
++            except subprocess.CalledProcessError as e :
++                if e.returncode == 255:
++                    if time.time() < timeout:
++                        time.sleep(args.interval)
++                    else:
++                        raise CLIError(
++                            "timed out waiting for ssh to open on %s." % args.name) from e
++                else:
++                    raise e
  def main_wait(parser, args):

uvtool

Merge ~thibf/uvtool:thibf/uvt-kvm_wait_system_booting into uvtool:main

Commit message

Description of the change

Preview Diff

Subscribers