Compose machine failure: Start tag expected, '<' not found, line 1, column 1

Bug #1690781 reported by Данило Шеган
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
Medium
Unassigned

Bug Description

Attempting to compose a machine on a virsh pod failed for me with an error like below:

  $ maas maas1 pod compose 2
  Unable to compose machine because: Failed talking to pod: Start tag expected, '<' not found, line 1, column 1

Full traceback from regiond.log is at http://paste.ubuntu.com/24549398/, and full traceback from rackd.log is at http://paste.ubuntu.com/24549396/

See the bottom of the bug for full reproduction steps. The error goes away after reboot if your virsh pod is properly set up, so I don't think this is a high priority. Workaround (other than reboot) is to restart libvirt-bin service.

Looking at what get_domain_capabilitEs() (watch out when you grep for it) does, it basically just calls out to

  virsh domcapabilities --virttype kvm

This was a nested VM instance where it failed with:

  ubuntu@maas1:~$ virsh domcapabilities --virttype kvm
  error: failed to get emulator capabilities
  error: invalid argument: unable to find any emulator to serve 'x86_64' architecture

so blindly attempting to parse that as XML failed.

Calling out to "virsh domcapabilities --virttype kvm /usr/bin/qemu-system-x86_64" worked for me, and afterwards, even just "virsh domcapabilities --virttype kvm" started to work.

It turns out the problem was that after installing qemu-kvm package, one needs to restart libvirt-bin service to get it to realize x86_64 emulator is present:

  sudo systemctl restart libvirt-bin.service

After that, everything worked fine too.

We should:

 1. Fix get_domain_capabilites() to check exit code of the virsh call (it's 1 for me, so it does indicate failure) and not attempt XML parsing the output in case of an error

 2. Surface an error from virsh call instead of the XML parsing error

 3. Perhaps suggest a workaround to the user, even though they might not be an administrator of the POD (install qemu-kvm package and restart libvirt-bin service)

 4. Maybe even file a bug against eg. qemu-kvm package (to restart libvirt-bin itself) or libvirtd (to watch for appearing emulators with inotify or whatever the latest FS monitoring solution is)

Alternatively, we could just use the detected architecture ourselves and try to find the emulator to "teach" libvirt-bin service about its presence if the package is already installed, though I am not sure if this would work well with a remote virsh connection.

To reproduce (probably possible to reproduce with just qemu-system-x86 instead of qemu-kvm package):

 sudo apt remove -y qemu-kvm; sudo apt autoremove -y; sudo apt install -y qemu-kvm; sudo systemctl restart libvirt-bin.service
 virsh domcapabilities --virttype kvm
 maas maas-connection pod compose POD-ID (get it using "maas maas-connection pods read")

All the other places in virsh pod driver seem to at least check for xml output being None (though in this case, I do get an "\n" and not an empty string on stdout) before attempting to parse as XML with etree.XML().

Related branches

Changed in maas:
importance: Undecided → Medium
milestone: none → 2.2.0rc5
status: New → Triaged
Changed in maas:
assignee: nobody → Newell Jensen (newell-jensen)
Changed in maas:
status: Triaged → In Progress
Changed in maas:
milestone: 2.2.0rc5 → 2.2.1
Changed in maas:
status: In Progress → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
Changed in maas:
status: Fix Released → Fix Committed
Revision history for this message
Marian Gasparovic (marosg) wrote :

We encountered this bug (or at least looks like it) three times while testing 2.8.5-deb

After we redeployed the machine, it works again.

Here is one of the runs

https://oil-jenkins.canonical.com/artifacts/a1657646-db57-49fd-9d8d-df28c777f653/index.html

Changed in maas:
status: Fix Committed → New
tags: added: cdo-qa cdo-release-blocker
Revision history for this message
Marian Gasparovic (marosg) wrote :
Revision history for this message
Marian Gasparovic (marosg) wrote :

Observation - once we hit this issue, all subsequent runs hit it (we don't restart or redeploy infra nodes between runs).
After I tried to reboot all infra nodes and run a test again, it passed this critical point.

Alberto Donato (ack)
Changed in maas:
assignee: Newell Jensen (newell-jensen) → nobody
Alberto Donato (ack)
Changed in maas:
milestone: 2.2.1 → none
Alberto Donato (ack)
Changed in maas:
status: New → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Is this issue reproducible on MAAS 3.2 or later?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.