OVF datasource should check if instant id is still on VMware Platform

Bug #1835205 reported by Pengpeng Sun
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
cloud-init
Expired
Medium
Unassigned

Bug Description

Currently DatasourceOVF does not check the instance id to determine if current instance is a new instance or not, so that every booting, cloud-init will go through entire datasource list.

This leads an issue:
When a VM's network is customized to static IP by cloud-init DatasourceOVF and then the VM is rebooted, the VM's network will be changed to DHCP after rebooting since this time no customization config file will be found. cloud-init uses DatasourceNone which change network to "default" configuration.

The expected behaviors are:
1. cloud-init checks the instance id to know if this instance is "iid-vmware-xxxxxx" which means it's a VMware VM. If yes, it should always use datasourceOVF.
2. When there has customization config file, datasourceOVF parse the configures and enable nics as usual.
3. When there is no customization config file, datasourceOVF wait for it until timeout and then cloud-init does NOT apply other datasources.

Related branches

Revision history for this message
Ryan Harper (raharper) wrote :

Thanks for filling a bug.

Cloud-init does check instance-id by looking in /var/lib/cloud/data/ at files 'instance-id' and 'previous-instance-id' and will check if the existing Datasource's metadata provides the same instance-id.

The behavior you describe *sounds* like the OVF datasource changed.

Please run cloud-init collect-logs as root and attach the tarball so we can see what's going on.

Thanks!

Changed in cloud-init:
importance: Undecided → Medium
status: New → Incomplete
Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Hi Ryan,

After cloud-init customization done, I saw /var/lib/cloud/instance links to /var/lib/cloud/instances/iid-vmware-XXXXXXXX. But this instance seems not take effect at next reboot, cloud-init always think it's a new instance. Do a reboot here, the /var/lib/cloud/instance links to /var/lib/cloud/instances/iid-datasource-none

I think `check_instance_id` is not implemented in OVF datasource.

Thanks,
Pengpeng

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Attach logs after cloud-init customization by OVF datasource.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Attach logs after reboot.

Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Ilia (ilia-zenkevich) wrote :

Hi all.
Faced with the same issue.

Environment:
vCloud Director 9.1
Ubuntu Server 16.04.1 x64
open-vm-tools 2:10.2.0-3~ubuntu0.16.04.1
cloud-init 19.1-1-gbaa47854-0ubuntu1~16.04.1

cloud-init log for second and later boots always contains

2019-07-23 13:22:39,347 - DataSourceOVF.py[DEBUG]: Did not find VMware Customization Config File
2019-07-23 13:22:39,347 - util.py[DEBUG]: Running command ['vmware-rpctool', 'info-get guestinfo.ovfEnv'] with allowed return codes [0]
 (shell=False, capture=True)
2019-07-23 13:22:39,353 - handlers.py[DEBUG]: finish: init-local/search-OVF: FAIL: no local data found from DataSourceOVF
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOVF.py", line 394, in read_ovf_environment
    props = get_properties(contents)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceOVF.py", line 553, in get_properties
    raise XmlError("No 'PropertySection's")
cloudinit.sources.DataSourceOVF.XmlError: No 'PropertySection's

If I run "vmware-rpctool info-get guestinfo.ovfEnv" manually, I can't see "PropertySection" sections, but the question is: Is this section something that have to be in the response?

Revision history for this message
Ryan Harper (raharper) wrote :

Note that check_instance_id() is only needed if you want to avoid fetching metadata/userdata a second time; it does not prevent cloud-init from knowing it's on the same instance; In all cases, the metadata that indicates the instance-id _must_ remain associated with the instance.

Looking at the logs; it appears that after reboot the Customization file is not found and OVF datasource _get_data() method returns False; which tells cloud-init that it's not using OVF as a datasource.

1. firstboot finds OVF Customication

2019-07-24 07:50:43,682 - util.py[DEBUG]: Cloud-init v. 19.1-1-gbaa47854-0ubuntu1~18.04.1 running 'init-local' at Wed, 24 Jul 2019 07:50:43 +0000. Up 12.30 seconds.
...
2019-07-24 07:50:43,802 - __init__.py[DEBUG]: Searching for local data source in: ['DataSourceOVF']
2019-07-24 07:50:43,802 - handlers.py[DEBUG]: start: init-local/search-OVF: searching for local data from DataSourceOVF
2019-07-24 07:50:43,802 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceOVF.DataSourceOVF'>
2019-07-24 07:50:43,802 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: New instance first boot

2019-07-24 07:50:43,813 - DataSourceOVF.py[DEBUG]: VMware Virtualization Platform found
2019-07-24 07:50:43,813 - DataSourceOVF.py[DEBUG]: Found the customization plugin at /usr/lib/open-vm-tools/plugins/vmsvc/libdeployPkgPlugin.so
2019-07-24 07:50:43,813 - DataSourceOVF.py[DEBUG]: Waiting for VMware Customization Config File
2019-07-24 07:50:48,818 - util.py[DEBUG]: waiting for configuration file took 5.005 seconds
2019-07-24 07:50:48,823 - DataSourceOVF.py[DEBUG]: Found VMware Customization Config File at /var/run/vmware-imc/cust.cfg
2019-07-24 07:50:49,928 - handlers.py[DEBUG]: finish: init-local/search-OVF: SUCCESS: found local data from DataSourceOVF
2019-07-24 07:50:49,929 - stages.py[INFO]: Loaded datasource DataSourceOVF - DataSourceOVF [seed=vmware-tools]

This sets the instance id to:

2019-07-24 07:50:49,945 - util.py[DEBUG]: Creating symbolic link from '/var/lib/cloud/instance' => '/var/lib/cloud/instances/iid-vmware-QokNwWPI'

2. Second boot of same instance does not find metadata from DataSourceOVF

2019-07-24 07:56:49,746 - DataSourceOVF.py[DEBUG]: VMware Virtualization Platform found
2019-07-24 07:56:49,746 - DataSourceOVF.py[DEBUG]: Found the customization plugin at /usr/lib/open-vm-tools/plugins/vmsvc/libdeployPkgPlugin.so
2019-07-24 07:56:49,746 - DataSourceOVF.py[DEBUG]: Waiting for VMware Customization Config File

2019-07-24 07:58:19,856 - util.py[DEBUG]: waiting for configuration file took 90.110 seconds
2019-07-24 07:58:19,858 - DataSourceOVF.py[DEBUG]: Did not find VMware Customization Config File

2019-07-24 07:58:20,152 - main.py[DEBUG]: [local] Exiting without datasource

Why did the customization data for the instance go away?

Ryan Harper (raharper)
Changed in cloud-init:
status: New → Incomplete
Revision history for this message
Ilia (ilia-zenkevich) wrote :

Because cloud-init deletes it in DataSourceOVF.py

Revision history for this message
Ryan Harper (raharper) wrote :

You're quite right. I don't understand how this is expected to work.

Removing the customization file is going to prevent cloud-init from knowing it's on the same instance if the instance-id came from it.

but I don't think it did, I see this:

 def read_vmware_imc(config):
@@ -296,6 +330,9 @@ def read_vmware_imc(config):
     if config.timezone:
         cfg['timezone'] = config.timezone

+ # Generate a unique instance-id so that re-customization will
+ # happen in cloud-init
+ md['instance-id'] = "iid-vmware-" + util.rand_str(strlen=8)

Which is completely broken w.r.t keeping the same datasource. The comment suggests that when redeploying to the same instance, "new" customization didn't happen, so to work around that
the datasource generates a new "random" instance-id, to avoid cloud-init's check on instance-id.

The OVF datasource is going to need quite a bit of work to sort out which paths are working and which are not; having the platform provide a consistent set of metadata to the instance in a reliable form for the life-time of the instance is a requirement. This doesn't appear to be happening via the Imc path.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Hi Ryan,

Thanks for the investigation.

There is no consistent user data provided to OVF datasource, the "VMware Customization Config File" is available only when the customization happens on the VM.
1. Do Guest OS Customization on the VM (firstboot finds OVF Customization)
With customization happened on VM, Open-VM-Tools can find and then copy cust.cfg from ESX server to the VM path /var/run/vmware-imc/, OVF Datasource will find the /var/run/vmware-imc/cust.cfg and then load it.
> 2019-07-24 07:50:48,823 - DataSourceOVF.py[DEBUG]: Found VMware Customization Config File at /var/run/vmware-imc/cust.cfg

2. Reboot VM (Second boot of same instance does not find metadata from DataSourceOVF)
This time without customization happened on VM, so no cust.cfg will be found by Open-VM-Tools, OVF Datasource can not find /var/run/vmware-imc/cust.cfg, and then existing without datasource.
> 2019-07-24 07:58:19,856 - util.py[DEBUG]: waiting for configuration file took 90.110 seconds
> 2019-07-24 07:58:19,858 - DataSourceOVF.py[DEBUG]: Did not find VMware Customization Config File

As far as I know, there are 2 paths.
1. customize VM by cloud-init and then re-customize VM by cloud-init.
This path works now since "random" instance-id keep cloud-init to search userdata for OVF datasource every boot

2. customization VM by cloud-int and then just reboot
This path is broken, the expected behavior is at the second reboot, cloud-init keep network configuration unchanged.
I do not think up a good solution, please shed some light on how to achieve it.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Links to VMware bug# 2377600

Pengpeng Sun (pengpengs)
Changed in cloud-init:
status: Incomplete → New
Revision history for this message
Ilia (ilia-zenkevich) wrote :

What does it mean? "Links to VMware bug# 2377600" - can't find bug with such id. Can you please give a direct link?

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@ilia, It should be "Links to VMware INTERNAL bug# 2377600". It just a tracker for me.

Paride Legovini (paride)
Changed in cloud-init:
status: New → Triaged
Revision history for this message
Eduardo Otubo (otubo) wrote :
Revision history for this message
Eduardo Otubo (otubo) wrote :

@Pengpeng Sun do you have a status on that? Would another pair of hands be useful? I run into this bug while checking the above linked BZ. I have access to vSphere and could help taking a look.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@otubo, From the Red Hat bug, I think it's not quite same with this one. For this one is after enabling VMware guest customization with cloud-init. With the latest cloud-init code, cloud-init will not disable it self due to VMware guest customization is enabled.

For Red Hat Linux, I filed a bug: https://bugzilla.redhat.com/show_bug.cgi?id=1722330 about no ds-identify exists. ds-identify is running to check if any datasources can be found during boot, if no datasources available, it will disable cloud-init. I suggest Red Hat Linux add this ds-identify, then it should fix the bug https://bugzilla.redhat.com/show_bug.cgi?id=1593010

Revision history for this message
Eduardo Otubo (otubo) wrote :

@pengpengs, the first problem is indeed include ds-identify inside the rpm, this was the beginning of the solution. But then I included, and I found out that after the first boot (the second boot, and so on) cloud-init doesn't disable itself even if vmware datasource is not found. Which I believe is part of this issue - instead it treats itself as a new instance (iid reset) and resets the network configuration to default (dhcp)

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@otubo, for this issue, cloud-init will be always enabled since ds-identify can find Open-VM-Tools installed and "disable_vmware_customization: false" from /etc/cloud/cloud.cfg file, with both of them, ds-identify will return OVF datasource and enable cloud-init. see code at https://github.com/cloud-init/cloud-init/blob/master/tools/ds-identify#L705

If you have included ds-identify to rpm, could you please check the generated /var/run/cloud-init/ds-identify.log on the Linux which cloud-init is enabled at second boot. Did it find another datasource?

Revision history for this message
Eduardo Otubo (otubo) wrote :

So what you're saying is that *before* my second boot I should set 'disable_vmware_customization: true', otherwise cloud-init will always be enable despite this bug?

Does it make sense to to have the function ovf_vmware_guest_customization() check for: Open-VM-Tools installed, 'disable_vmware_customization: false', if datasource is available *and* if the iid is still the same? Adding these last two check would avoid me to change the configuration from true to false and form false to true everytime I need a customization. I would just need to attach the datasource and we would be good to go.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

If you have set 'disable_vmware_customization: false' in /etc/cloud/cloud.cfg, and had Open-VM-Tools installed and it's a Vitual Machine running on vSphere, you will hit this bug.
So there are 3 conditions, remove anyone of them, cloud-init will NOT use DatasourceOVF to do customization.
ovf_vmware_guest_customization() is to check these 3 conditions and returns DataSourceOVF found when all 3 conditions are true.

I have a question on 'attach the datasource', are you attaching an ISO to CD-ROM?

Revision history for this message
Robbert Muller (mjrider) wrote :

Thinking a bit outside the box maybe

would getting the instance-id from the machine, just like vmware-guestinfo datasource, a better idea?

https://github.com/vmware/cloud-init-vmware-guestinfo/blob/99442e7ab784561c116caae54fe3a0c4a947a207/DataSourceVMwareGuestInfo.py#L160-L167

fudging around with a random instance id to recustomize a machine sounds like the wrong way around.

Revision history for this message
Eduardo Otubo (otubo) wrote :

@pengpengs I'm using the data source from vCenter "Policies and profile", there's no ISO or CDROM. Would you need a hand to help debugging this issue? I have access to vSphere cluster to test and create vms, etc.

Thanks.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Eduardo,
Yes, please help to debug this. I will also work on this soon.
The user data comes from /var/run/vmware-imc/cust.cfg not ISO or CDROM.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

VMware internal PR #2377600 is tracing this issue.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Ryan As you know, PR #229 (https://github.com/canonical/cloud-init/pull/229) author has not responded for months. To fix the issue, could you please help to implement the similar change on LOCAL datasource only?

Revision history for this message
Ryan Harper (raharper) wrote :

@Pengpeng

Have you tried testing this bug with PR #229 applied?

I suspect that it might just work since DataSourceOVF does not persist an obj.pkl

Per comment:

https://github.com/canonical/cloud-init/pull/229/commits/82d039b0496902ceef4e73586e533295903e2e31#r397500586

The issue there was that if the datasource persisted an object; then init-net stage would
use the existing object on disk and not use the fallback path.

For OVF, at local time, it will run _get_data over the datasource list, which after reboot per this bug, it not return True; so I believe then that the fallback path should trigger as expected.

For Ec2 or other datasource which do persist objects it's not clear what a fallback scenario looks like.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Ryan

Yes, I have verified the fallback path worked after PR #229 applied. so it should fix this issue.

I do NOT get "...DataSourceOVF does not persist an obj.pkl". From what I saw, after loaded DataSourceOVF, obj.pkl file was created under instance path. Doesn't the fallback ds load from obj.pkl?

Revision history for this message
Ryan Harper (raharper) wrote :
Download full text (3.9 KiB)

@Pengpeng,

> Yes, I have verified the fallback path worked after PR #229 applied. so it
> should fix this issue.

OK. That's good news. I think for local datasources, the fallback option
works quite well.

>
> I do NOT get "...DataSourceOVF does not persist an obj.pkl". From what I
> saw, after loaded DataSourceOVF, obj.pkl file was created under instance
> path. Doesn't the fallback ds load from obj.pkl?

Sorry. I should have been more clear. While the obj.pkl is written for
local datasources, it's only used in transitioning between local and net
stages:

For all datasources, cloud-init boot sequence does the following:

1) cloud-init init --local will purge the object cache unless it's told
   not too (manual cache clean); this removes the finished boot flag.
   and sets the 'existing' variable to 'check'; this is important for
   the next stage of cloud-init (cloud-init init)

   cloudinit/cmd/main.py:main_init lines 305 -> 317

   Now, cloud-init init --local will fetch the datasource with
   existing='check'

   cloudinit/stages.py:fetch() lines 349
     cloudinit/stages.py:_get_data_sources(existing='check') lines 236
       cloudinit/stages.py:_restore_from_checked_cache(existing='check') lines 211
         cloudinit/stages.py:_restore_from_cache() lines 184
           cloudinit/stages.py:_pkl_load() lines 943

    On reboot scenarios; the datasource is reloaded and then cloud-init
    compares the instance-id from /var/lib/cloud/data/instance_id to the
    ds.get_instance_id() value; if they match, then it's restored from cache

    If the do not match, then the datasource from cache (obj.pkl) is
    ignored. Then cloud-init will walk the local datasources calling
    ._get_data() on each until a local datasource is found.

    Once a datasource is found (one of the datasources _get_data() method
    returns True, then self.datasource is set and we do persist an obj.pkl

    cloud-init init --local exits.

2) cloud-init runs after networking is up, this time in main.py:main_init
   ds mode is in NET mode which sets existing='trust'; the idea here
   is that cloud-init local mode detects and finds the correct datasource
   and so at NET stage we don't need to look for a datasource again if one
   was found.

   For *network only* datasources (Ec2 for example); ones that can only be
   detected by checking a network endpoint (for example
   http://169.254.169.254) will not be checked until networking is up; so in
   (1) we don't bother checking until this stage.

   When cloud-init attempts to restore_from_checked_cache() with
   existing='trust', then the pkl_load() succeeds.

   Next, cloud-init looks up the on-disk instance-id:
      /var/lib/cloud/data/instance-id
   And then calls ds.get_instance_id() and if they match, then we're on
   the same instance as we were before. For local datasources; this is
   trivially true as when we found the datasource in (1) we persisted the
   object and we've just loaded it.

For local-only datasources (NoCloud, OVF); I think 229 works; Specifically
for OVF since it's detection is based on files that are removed after first
boot. I'd like to add a unittest to exercise this path, spe...

Read more...

Revision history for this message
MikeN (mike-normi) wrote :

Just wanted to chime in here as a Vmware user that is suffering from the current implementation, to give some food for thought.

I do not really see what any of the proposed changes would fix. The root of all the issues is:

- Vmware customization using a file (vmware-imc) only supports network config, but no user-data or anything else, and is thus quite useless for cloud-init.
- Vmware customization using the OVF file supports user-data and is persistent, but does not support any network config, and is thus quite useless for setting up a VM as well.

What I do not understand, is why all this effort is not put in providing network config through the OVF file (which can be easily done by just adding a 'meta-data' base64 string next to the 'user-data' base64 string). A bug is open for this (#1247055) but there does not seem to be happening a lot there.

As the vmware-imc customization will never offer any user-data _and_ has the problem that is disappears, using that in any sensible way requires fixing the data source caching, but also fixing a merge between OVF data and IMC data (which is bug #1806133). If any of these issues are not fixed, it will only be possible to provide network config through IMC, but supplying any OVF user-data will always trigger a re-init on next reboot, breaking the system.

From the Vmware point of view, I do not understand why they don't put their efforts in getting cloud-init-vmware-guestinfo merged into the main cloud-init tree, and subsequently adding support for this in VCloud Director (which is unable to set these guestinfo properties at the moment).

I would be happy to look into supporting meta-data through the OVF file, and is seems quite trivial, but let me know if I'm missing something here and if it doesn't make sense.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Ryan,

Thanks a lot! You explained the stages clearly.
While I think you already know this: When I apply #229 in my test, I set fallback to True only when mode == sources.DSMODE_LOCAL.

- init.fetch(existing=existing)
+ init.fetch(existing=existing,
+ fallback=(mode == sources.DSMODE_LOCAL))

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@MikeN, I fully understand what you are suffering from. You are right about OVF can apply only meta-data from vmware-imc configuration or only user-data from OVF file. One solution for this could be vmware-imc customization offers both meta-data and user-data to cloud-init.

Revision history for this message
Steffen Becker (snowball77) wrote :

I am facing the same issue. I have installed the latest focal cloud image using the user-data field to customize the VM and also set disable_vmware_customization: false for the first boot. I converted it into a template for cloning. It customizes correctly on first boot of a clone from the template. On reboot it falls back.

In this setup it does not even matter that you remove disable_vmware_customization: again. Also with the removed config it does not persist the custom static IP on the reboot. Only creating /etc/cloud/cloud-init.disabled works and helps to persist the custom IP settings. However, then I cannot customize the VM a second time.

On OVF Details page, transfering the OVF via ISO is enabled in this setup, coming from the imported ova cloud image. If you want me to provide any logs I am happy to do so.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Hi @Steffen,

Yes, this issue has not been fixed on the current latest cloud-init version 20.3, please refer VMware KB: https://kb.vmware.com/s/article/71264 to workaround this issue.

@Ryan, do you think this issue can be fixed in next cloud-init release.

Revision history for this message
Ben (bwatson1979) wrote :

Our process is to download the raw cloud-init image and upload it to a datastore in vSphere/vCenter as a template. We then clone the template to a VM, create an ISO CD-ROM image with meta-data and user-data at the root of the ISO, attach the ISO to the CD-ROM of the VM, then power on the VM.

This process has served us well with Ubuntu 16.04 through 20.04 and for EL 7 (RHEL/CentOS). It isn't until EL 8 that we've run into issues. Various RedHat articles, issues, and bugzilla entries have landed me here. I've recently tried RHEL 8.2 and still cannot get it to cloud-init with our process that has worked so well.

If you look at the running VM in the VMWare console, the screen immediately says "Probing EDD (edd off to disable)... ok" and sits like that indefinitely. I'm not sure if my issues are related to this or not, but it is difficult to troubleshoot since I cannot even get a command prompt to get into the VM and see what's going on.

Revision history for this message
Eduardo Otubo (otubo) wrote :

Anyone have an update on this issue?
We have some RHEL instances on Power that are being affected.

Thanks!

Revision history for this message
James Falcon (falcojr) wrote :
Changed in cloud-init:
status: Triaged → Expired
Revision history for this message
Andreas Lindhé (lindhe) wrote :

Does the GitHub Issue replace this launchpad site for discussions, or why have two?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.