Comment 27 for bug 1835205

Revision history for this message
Ryan Harper (raharper) wrote :

@Pengpeng,

> Yes, I have verified the fallback path worked after PR #229 applied. so it
> should fix this issue.

OK. That's good news. I think for local datasources, the fallback option
works quite well.

>
> I do NOT get "...DataSourceOVF does not persist an obj.pkl". From what I
> saw, after loaded DataSourceOVF, obj.pkl file was created under instance
> path. Doesn't the fallback ds load from obj.pkl?

Sorry. I should have been more clear. While the obj.pkl is written for
local datasources, it's only used in transitioning between local and net
stages:

For all datasources, cloud-init boot sequence does the following:

1) cloud-init init --local will purge the object cache unless it's told
   not too (manual cache clean); this removes the finished boot flag.
   and sets the 'existing' variable to 'check'; this is important for
   the next stage of cloud-init (cloud-init init)

   cloudinit/cmd/main.py:main_init lines 305 -> 317

   Now, cloud-init init --local will fetch the datasource with
   existing='check'

   cloudinit/stages.py:fetch() lines 349
     cloudinit/stages.py:_get_data_sources(existing='check') lines 236
       cloudinit/stages.py:_restore_from_checked_cache(existing='check') lines 211
         cloudinit/stages.py:_restore_from_cache() lines 184
           cloudinit/stages.py:_pkl_load() lines 943

    On reboot scenarios; the datasource is reloaded and then cloud-init
    compares the instance-id from /var/lib/cloud/data/instance_id to the
    ds.get_instance_id() value; if they match, then it's restored from cache

    If the do not match, then the datasource from cache (obj.pkl) is
    ignored. Then cloud-init will walk the local datasources calling
    ._get_data() on each until a local datasource is found.

    Once a datasource is found (one of the datasources _get_data() method
    returns True, then self.datasource is set and we do persist an obj.pkl

    cloud-init init --local exits.

2) cloud-init runs after networking is up, this time in main.py:main_init
   ds mode is in NET mode which sets existing='trust'; the idea here
   is that cloud-init local mode detects and finds the correct datasource
   and so at NET stage we don't need to look for a datasource again if one
   was found.

   For *network only* datasources (Ec2 for example); ones that can only be
   detected by checking a network endpoint (for example
   http://169.254.169.254) will not be checked until networking is up; so in
   (1) we don't bother checking until this stage.

   When cloud-init attempts to restore_from_checked_cache() with
   existing='trust', then the pkl_load() succeeds.

   Next, cloud-init looks up the on-disk instance-id:
      /var/lib/cloud/data/instance-id
   And then calls ds.get_instance_id() and if they match, then we're on
   the same instance as we were before. For local datasources; this is
   trivially true as when we found the datasource in (1) we persisted the
   object and we've just loaded it.

For local-only datasources (NoCloud, OVF); I think 229 works; Specifically
for OVF since it's detection is based on files that are removed after first
boot. I'd like to add a unittest to exercise this path, specifically we'd
want.

  a) /var/lib/cloud/* populated as it would look after a first boot
  b) run init_main with args.local=True
  c) mock datasource._get_data to return False (forcing down the fallback
  d) we should verify that self.datasource matches what's in obj.pkl

We should also test paths around OVF instance_id changing/resetting. #229
operates under the assumption that the instance_id should not change between
reboots. A second unitest which ensure that if OVF instance_id changes that
the fallback path does NOT successful load OVF via fallback.

For Ec2, which is detected locally (by checking system UUID string), I don't
think we ever use the fallback path due to the local detection; this means
we always re-use the on-disk obj.pkl. I don't think that's a blocker to
merging.