cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached

Bug #1806701 reported by Pengzhen(Peter) Cao
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Fix Released
Medium
Pengpeng Sun

Bug Description

We have found in our test for SLES15 with cloud-init installed, if we attach a ISO file with the VM before VM is boot, it often takes more than 10 minutes to start the SLES OS. Sometimes it failed to start the SLES OS at all.

We've root caused it is due to the "is_cdrom_ovf()" func of "tools/ds-identify".

In this function, there is the following logic to detect if an ISO contains certain string:
> local idstr="http://schemas.dmtf.org/ovf/environment/1"
> grep --quiet --ignore-case "$idstr" "${PATH_ROOT}$dev"
ref: https://git.launchpad.net/cloud-init/tree/tools/ds-identify

It is trying to grep the who ISO file for a certain string, which causes intense IO pressure for the system.
What is worse is that sometimes the ISO file is large (e.g. >5GB for installer DVD) and it is mounted over NFS. The "grep" process often consume 99% CPU and seems hang. Then the systemd starts more and more "grep" process which smoke the CPU and consumes all the IO bandwidth for the ISO file. Then the system may hang for a long time and sometimes failed to start.

To fix this issue, I suggest that we should not grep for the entire ISO file. Rather then we should just check if the file/dir exists with os.path.exists().

-------------------------debug log snip------------------------
pek2-gosv-16-dhcp180:~ # ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 13:32 ? 00:00:04 /usr/lib/systemd/systemd --switched-root --system --deserialize 24

root 474 1 0 13:34 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 482 474 2 13:34 ? 00:00:15 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1020 1 0 13:35 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1039 1020 1 13:35 ? 00:00:07 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
polkitd 1049 1 0 13:37 ? 00:00:00 /usr/lib/polkit-1/polkitd --no-debug
root 1051 1 0 13:37 ? 00:00:00 /usr/sbin/wickedd --systemd --foreground
root 1052 1 0 13:37 ? 00:00:00 /usr/lib/systemd/systemd-logind
root 1054 1 0 13:37 ? 00:00:00 /usr/sbin/wickedd-nanny --systemd --foreground
root 1073 1 0 13:37 ? 00:00:00 /usr/bin/vmtoolsd
root 1097 1 0 13:37 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1110 1097 1 13:37 ? 00:00:04 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1304 1 0 13:38 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1312 1304 1 13:38 ? 00:00:03 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1
root 1537 1 0 13:40 ? 00:00:00 /usr/bin/plymouth --wait
root 1613 1 0 13:40 ? 00:00:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 1645 1613 0 13:40 ? 00:00:02 grep --quiet --ignore-case http://schemas.dmtf.org/ovf/environment/1 /dev/sr1

Grep use nearly 100% cpu, system very slow.

top - 13:46:37 up 26 min, 2 users, load average: 14.14, 15.03, 10.57
Tasks: 225 total, 6 running, 219 sleeping, 0 stopped, 0 zombie
%Cpu(s): 40.1 us, 49.3 sy, 0.0 ni, 0.0 id, 1.4 wa, 0.0 hi, 9.1 si, 0.0 st
KiB Mem : 1000916 total, 64600 free, 355880 used, 580436 buff/cache
KiB Swap: 1288168 total, 1285600 free, 2568 used. 492688 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4427 root 20 0 40100 3940 3084 R 99.90 0.394 0:27.41 top
1016 root 20 0 197796 4852 3400 R 99.90 0.485 1:26.44 vmtoolsd
1723 root 20 0 7256 1860 1556 D 99.90 0.186 0:28.44 grep
  484 root 20 0 7256 1684 1396 D 99.90 0.168 1:51.22 grep
1278 root 20 0 7256 1856 1556 D 99.90 0.185 0:38.44 grep
1398 root 20 0 7256 1860 1556 R 99.90 0.186 0:28.53 grep
1061 root 20 0 7256 1856 1556 D 99.90 0.185 0:56.62 grep
-------------------------debug log snip------------------------

Related branches

Revision history for this message
Scott Moser (smoser) wrote :

@Peter,

You're correct that ds-identify both that ds-identify does grep through
and entire cdrom if attached with an ISO9660 filesystem. The reason for this hack
is that
a.) the OVF specification does not indicate the filesystem label for ISO transport.
If it did, we could require that. As it is, there is no easy way to positively
identify the OVF transport cdrom.
b.) when this code was developed I asked a vmware contact if I could rely on
any such label in practice and was told no.
c.) we did not want to attempt a mount in ds-identify and cannot rely on the
filesystem being mounted at that point in time.

Note that commit 530850f971e improves the situation on vmware to
*not* do the grep if the filesystem has a label 'OVF ENV'. In some recent
experience I've seen this label provided by vmware.

If we can rely on that label on VMWare OVF systems, then I think I would be
fine with dropping the grep.

Note that the grep is protected from executing in many ways. It will only
execute if:
 a.) the ISO9660 filesystem is on a device named /dev/sr[0-9] or /dev/hd[a-z].
 b.) the label is not one of case insensitive: OVF-Transport OVFENV, OVF ENV, config-2, rd_rdfe_stable*, cidata.

If we cannot drop the grep all together because VMware/OVF spec does not provide
us a filesystem label to match on, then we could potentially further limit the
grep to only occur if the device attached was very small (perhaps < 10M ?).
That way we would not grep through DVD or 700M install ISOs.

Thoughts?

Revision history for this message
Pengzhen(Peter) Cao (pengzhencao) wrote :

In the case of OVF customization the protection could work. However, if a user choose to install an OS from the installation ISO image then it could not. In our example we were trying to install a SLES15 OS from its installer DVD attached by virtual CDROM(/dev/sr0).

It is the OS vender's choice for what the ISO's lable is, usually "XXX installer DVD". "XXX package DVD"...

If we can not just check if certain config file exists on the ISO image, then I think your suggestion for limiting the ISO file size before grep seems a solution. As the config iso file will defenitely be a small one.

Revision history for this message
Scott Moser (smoser) wrote :

@Peter,

I think I'm happy to drop the 'grep' if you tell me that vmware installations will always label the OVF transport 'OVF ENV' which is my current experience with a vSphere Client 6.0.0 and vCenter Server 6.0.0.

if we cannot rely on that, then other heuristic checks is all we can do.

Revision history for this message
Scott Moser (smoser) wrote :

well, not *all* we can do, but without adding either a 'mount' or a dependency on something like 'isoinfo'.

Revision history for this message
Pengzhen(Peter) Cao (pengzhencao) wrote :

Let me check with OVF team internally and update you if we use fixed name for the iso lable.

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

Hi Scott,

If only consider to speed up VMware guest customization process, could we move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()?
https://github.com/cloud-init/cloud-init/blob/12bc76cebf69a1c8cf9eba78431333842ed170cf/tools/ds-identify#L772

Not sure if there is side-effect to do this change.

Thanks,
Pengpeng

Revision history for this message
Scott Moser (smoser) wrote : Re: [Bug 1806701] Re: cloud-init may hang OS boot process due to grep for the entire ISO file when it is attached

it does seem like that would improve things, pengpeng. Much better
would be for vmware to promise (at least with new versions) that the
attached OVF cdrom would have a specific label.

On Fri, Feb 22, 2019 at 3:30 AM Pengpeng Sun <email address hidden> wrote:
>
> Hi Scott,
>
> If only consider to speed up VMware guest customization process, could we move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()?
> https://github.com/cloud-init/cloud-init/blob/12bc76cebf69a1c8cf9eba78431333842ed170cf/tools/ds-identify#L772
>
> Not sure if there is side-effect to do this change.
>
>
> Thanks,
> Pengpeng
>
> --
> You received this bug notification because you are subscribed to cloud-
> init.
> https://bugs.launchpad.net/bugs/1806701
>
> Title:
> cloud-init may hang OS boot process due to grep for the entire ISO
> file when it is attached
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cloud-init/+bug/1806701/+subscriptions

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Scott, agree with you. While I checked the current versions, the Label contains no fixed string. If you attached a VMTools ISO, then LABEL=VMware Tools, TYPE=iso9660.
If you attached a RHEL7.6 installer ISO, then LABEL=RHEL-7.6 Server.x86_64, TYPE=iso9660.
so LABEL depends on what's ISO attached.
We will try to get an official answer from OVF team internally, but the answer could be 'no specific label'.

Thanks,
Pengpeng

Pengpeng Sun (pengpengs)
Changed in cloud-init:
assignee: nobody → Pengpeng Sun (pengpengs)
Revision history for this message
Pengpeng Sun (pengpengs) wrote :

>Note that commit 530850f971e improves the situation on vmware to
>*not* do the grep if the filesystem has a label 'OVF ENV'. In some recent
>experience I've seen this label provided by vmware.

As Scott mentioned, got the label 'OVF ENV' when attached OVF cdrom which need enable vApp options on a VM.
While the vApp options are not enabled by default on a VMware VM, so can NOT reply on it to check if current is a VMware installation.
To solve this issue, checking size might be the best option, that grep entire cdrom size only if it's size is small (ex: <=10MB).
And to speed up VMware guest customization, could move if ovf_vmware_guest_customization check to be in front of "is_cdrom_ovf" in dsCheckOVF()? , see https://bugs.launchpad.net/cloud-init/+bug/1806701/comments/6

Copy Ryan and Dan on this.

*** An article on how to enable vApp options on a VM: https://www.virtuallyghetto.com/2012/06/ovf-runtime-environment.html

Ryan Harper (raharper)
Changed in cloud-init:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Pengpeng Sun (pengpengs) wrote :
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

This bug is fixed with commit d9769c47 to cloud-init on branch master.
To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=d9769c47

Changed in cloud-init:
status: Triaged → Fix Committed
Revision history for this message
Ryan Harper (raharper) wrote : Fixed in cloud-init version 19.2.

This bug is believed to be fixed in cloud-init in version 19.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in cloud-init:
status: Fix Committed → Fix Released
Revision history for this message
Nicklas (nicklaswallgren) wrote :

We got recurring problems with cloud-init/ds-identify process. It spawns sub-processes "blkid -c /dev/null -o export" which gets stuck in the "D" uninterruptible sleep state.

The processes cannot be killed, so the only solution is to reboot the affected server.

root 3839 0.0 0.0 4760 1840 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 3844 0.0 0.0 11212 2836 ? D Dec05 0:00 \_ blkid -c /dev/null -o export
root 6943 0.0 0.0 4760 1880 ? S Dec05 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6948 0.0 0.0 11212 2844 ? D Dec05 0:00 \_ blkid -c /dev/null -o export
root 6111 0.0 0.0 4760 1916 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 6149 0.0 0.0 11212 2940 ? D Dec12 0:00 \_ blkid -c /dev/null -o export
root 8765 0.0 0.3 926528 24968 ? Ssl Dec12 0:12 /usr/lib/snapd/snapd
root 9179 0.0 0.0 4760 1892 ? S Dec12 0:00 /bin/sh /usr/lib/cloud-init/ds-identify
root 9185 0.0 0.0 11980 3552 ? D Dec12 0:00 \_ blkid -c /dev/null -o export

Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic

5.0.0-36-generic #39~18.04.1-Ubuntu SMP Tue Nov 12 11:09:50 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Pengpeng Sun (pengpengs) wrote :

@Nicklas, I think the /var/run/cloud-init/ds-identify.log is helpful to investigate the problem.
And could you please file a new bug with above log file, it's a different issue with the one in this bug.

Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.