Instances from large, fresh EMIs fail to start just after being registered

Bug #439410 reported by Thierry Carrez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Eucalyptus
Fix Released
High
Unassigned
eucalyptus (Ubuntu)
Fix Released
Low
Thierry Carrez

Bug Description

Running 1.6~bzr854-0ubuntu12

Starting an instance of a freshly baked EMI, I get on the NC:

nc.log
[EUCADEBUG ] walrus_request(): wrote 36 bytes in 1 writes
[EUCAERROR ] walrus_request(): server responded with HTTP code 500
[EUCAINFO ] walrus_request(): due to error, removing /var/lib/eucalyptus/instances/admin/i-35F90747/disk
[EUCAERROR ] error: failed to download file from Walrus into /var/lib/eucalyptus/instances/admin/i-35F90747/disk
[EUCAFATAL ] Failed to prepare images for instance i-35F90747 (error=1)

Running a new instance of the same EMI just after that usually succeeds.
Couldn't find anything useful in the CC logs.

Thierry Carrez (ttx)
Changed in eucalyptus (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Thierry Carrez (ttx) wrote :

Confirmed on a i386 install.
First instance run fails. Kernel and ramdisk are downloaded correctly but image download fails after a couple minutes.
Attached nc.log that shows Walrus error 500.
Nothing on the cluster side logs that stands out, I kept the logs though in case of need.

Running a new instance of the same AMI just after that failure also failed. Subsequent runs worked.
When it works, here is the nc.log you get:
[Thu Oct 1 10:39:24 2009]
[EUCAINFO ] walrus_request(): downloading /var/lib/eucalyptus/instances/admin/i-2DD206C8/disk
[EUCAINFO ] from http://192.168.0.127:8773/services/Walrus/uecimage/ubuntu-uec-karmic-i386.img.manifest.xml
[EUCADEBUG ] walrus_request(): writing GET/GetDecryptedImage output to /var/lib/eucalyptus/instances/admin/i-2DD206C8/disk
then, after 3 minutes:
[EUCADEBUG ] walrus_request(): wrote -2147483648 bytes in 2 writes
[EUCAINFO ] walrus_request(): saved image in /var/lib/eucalyptus/instances/admin/i-2DD206C8/disk

Changed in eucalyptus (Ubuntu):
status: New → Confirmed
Revision history for this message
Thierry Carrez (ttx) wrote :

(that last try was with eucalyptus 1.6~bzr854-0ubuntu13)

tags: added: eucalyptus
Changed in eucalyptus (Ubuntu):
importance: Medium → High
Revision history for this message
Thierry Carrez (ttx) wrote :

This is a timeout occurring on large images (10G) that were just registered.
The workaround is to use smaller images, or wait more between registration and first usage.
Working with upstream on how to best solve that case.

Changed in eucalyptus (Ubuntu):
importance: High → Medium
status: Confirmed → Triaged
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

In the absolute worst case, this can be documented, or SRU'd.

:-Dustin

Changed in eucalyptus (Ubuntu):
importance: Medium → Low
Thierry Carrez (ttx)
summary: - Instances from fresh EMIs sometimes fail to start
+ Instances from large, fresh EMIs fail to start just after being
+ registered
Daniel Nurmi (nurmi)
Changed in eucalyptus:
importance: Undecided → High
Revision history for this message
Daniel Nurmi (nurmi) wrote :

The problem is that the connection from the NC to Walrus is being successfully opened, but is being timed out on the server side after two minutes if no data is written to the wire (which is the case when the large image is being decrypted/cached). We've increased the retry count in the NC to 10 retries (each at most lasting no more than 2 minutes == 20 minutes max), which should allow enough time for images (~10GB) to be processed even on slow machines.

Fixed in Eucalyptus revision 920.

Changed in eucalyptus:
status: New → Fix Committed
Thierry Carrez (ttx)
Changed in eucalyptus (Ubuntu):
assignee: nobody → Thierry Carrez (ttx)
status: Triaged → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

Verified fixed on a 10G image

Changed in eucalyptus (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6~bzr919-0ubuntu3

---------------
eucalyptus (1.6~bzr919-0ubuntu3) karmic; urgency=low

  [ Matt Zimmerman ]
  * Kill the Eucalyptus DHCP server in eucalyptus-cc.upstart:stop
    (LP: #446056)

  [ Thierry Carrez ]
  * Add missing gnumail-providers.jar and inetlib.jar links to
    /usr/share/eucalyptus to enable email sending (LP: #449530)
  * Cherrypick upstream rev920, fixing Walrus timeouts (LP: #439410)

 -- Thierry Carrez <email address hidden> Mon, 12 Oct 2009 17:37:39 +0200

Changed in eucalyptus (Ubuntu):
status: Fix Committed → Fix Released
tags: added: iso-testing
Changed in eucalyptus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.