[SRU] eucalyptus create and delete volumes sometimes fail on lvm commands (POC + ENT configs)

Bug #590929 reported by Chris Cheney
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Eucalyptus
Fix Released
Undecided
Neil Soman
1.6.2
Fix Released
Undecided
Neil Soman
Jaguarundi
Fix Released
High
Dustin Kirkland 
eucalyptus (Ubuntu)
Fix Released
Medium
Unassigned
Lucid
Fix Released
Medium
Dustin Kirkland 

Bug Description

When testing out bug 586134 I found another issue and although similar to bug 517086 appears not to be the same issue.

If you create for example 200 1GB volumes like this:

for i in `seq 1 200` ; do euca-create-volume -z cluster1 -s 1 ; done

You will end up with quite a few failed. If you look in the logs it is usually caused by vgcreate failing for unknown reasons as the output of the command is not logged. If you run the same command on the command line that failed it will then work.

Also if you then delete all the volumes like this:

for i in `euca-describe-volumes | awk '{print $2}'` ; do euca-delete-volume $i ; done

You will end up with a huge number that do not get removed. If you look in the logs it is usually caused by lvremove failing for unknown reasons like with vgcreate. Again if you run the same command on the command line that failed it will then work.

Also despite the failure to properly delete the volume euca-describe-volumes will have invalid information. For example if you delete the volume it will no longer show up in the list but it will still be there as seen by losetup and vgscan, and will still be set back up by eucalyptus when rebooted but it will not show up in the euca-describe-volume list.

Chris

===============
SRU INFORMATION

IMPACT:
This bug affects users of UEC, trying to create EBS volumes using the euca-create-volume command. In the current packages, this command will fail from time to time, for no apparent or reported reason, and Eucalyptus does not recover gracefully and retry the volume creation. This is a problem that heavy UEC users will probably encounter at some point in time.

HOW ADDRESSED:
This bug has been addressed by the upstream Eucalyptus maintainer of this code. The Eucalyptus 2.0 branch (in Maverick) does not suffer from this bug, as much of the storage related code has been refactored in the interest of scalability. Neil Soman has graciously backported the minimal subset of those changes necessary to address this bug. Still the patch is a few hundred lines. But it's all in the interest of solving this bug.

PATCH:
See bzr revisions r1235 and r1236 of lp:eucalyptus (upstream Eucalyptus branch).

REPRODUCE:
To reproduce this bug, simply:
  for i in $(seq 1 40); do euca-create-volume -z cluster1 -s 1; done
  sleep 10
  euca-describe-volumes | grep fail
  euca-describe-volumes | awk '{print $2}' | xargs -i euca-delete-volume {}
I see failures trying to create a few as 40 volumes. (Note that I chose 40, as I have 50GB of disk free). We would expect being able to do the above ad nauseum, without unhandled errors.

REGRESSION POTENTIAL:
There is perhaps some chance of regression in the storage related functionality in Eucalyptus. For this reason, this particular SRU package will be heavily tested by Canonical OEM Server QA as well as Ubuntu Platform QA and Ubuntu Platform Server team members.

===============

Chris Cheney (ccheney)
description: updated
description: updated
Revision history for this message
C de-Avillez (hggdh2) wrote :

Setting confirmed/Medium. Chris and I recreated this scenario on different rigs.

Changed in eucalyptus (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Chris Cheney (ccheney) wrote :

This might actually be a problem with lvm, but even if it is caused by lvm it should still be properly caught by eucalyptus so that if the delete volume failure happens it will show up in euca-describe-volumes

Changed in eucalyptus (Ubuntu Lucid):
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
C de-Avillez (hggdh2) wrote :

for the record: eucalyptus 1.6.2-0ubuntu30.1 (myself) and 1.6.2-0ubuntu30.2 (Chris, with his fixes for 30.2)

Revision history for this message
Chris Cheney (ccheney) wrote :

This appears to be caused by eucalyptus not checking the return values it gets.

Revision history for this message
Chris Cheney (ccheney) wrote :

And to top it off lvm does not document its return codes, at least not in the man pages anyway.

Revision history for this message
Chris Cheney (ccheney) wrote :

It appears lvm always returns error code 5 on failure... for any kind of failure. So I'm not sure exactly what we should do here to fix this problem. Often the problem is actually just a can't get lock issue like the following, but without having a useful error code we can't know when to retry or to just fail.

20.35.01.198672267
/sbin/vgcreate vg-lDNyug.. /dev/loop6
  /var/lock/lvm/P_orphans: flock failed: Resource temporarily unavailable
  Can't get lock for orphan PVs
status = 5

Revision history for this message
Chris Cheney (ccheney) wrote :
Revision history for this message
Chris Cheney (ccheney) wrote :

It appears the liblvm library is in newer lvm2 source which is in Debian unstable. Whether converting eucalyptus to it is actually needed and/or if it would be enough to help fix these sorts of issues is still unknown. From a cursory look it appears it may be useful.

Revision history for this message
Chris Cheney (ccheney) wrote :

It seems its not so much the not checking for the return code as not doing the right thing afterwards. If it receives an error it appears to just throw an exception and does not clean up after itself, at least for the createLogicalVolume case.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

If I lower N from 200 to 50, I have been able to successfully do this a dozen times or so.

for i in `seq 1 50` ; do euca-create-volume -z cluster1 -s 1 ; done
for i in `euca-describe-volumes | awk '{print $2}'` ; do euca-delete-volume $i ; done

@eucalyptus developers: is there a known or expected limitation here?

mahmoh (mahmoh)
Changed in jaguarundi:
status: New → Confirmed
Revision history for this message
C de-Avillez (hggdh2) wrote :

I and Chris were able to repeat this by creating a lot of volumes batch mode, pretty much like Dustin points above. We got a mix of losetup errors (IIRC, a lock held), and various VG/LV; the result is an inconsistent state: m loop devices, n VGs, and l LVs, with l < n < m.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

The system I'm working on here has a 127GB disk. It seems I'm able to create 50 x 1GB volumes, and then I get nothing but this error:

Volume: Error while communicating with Storage Controller: CreateStorageVolume:Your proposed upload exceeds the maximum allowed object size.

Revision history for this message
Neil Soman (neilsoman) wrote :

I have been unable to reproduce this issue against 2.0. Going to try against 1.6.2 and report back.

@Dustin,

"Volume: Error while communicating with Storage Controller: CreateStorageVolume:Your proposed upload exceeds the maximum allowed object size."

Web UI->configuration tab. Please increase the total max for volumes (default is 50).

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Okay, I verified that cluster1.storage.maxtotalvolumesizeingb in the web front end is greater than the size of volumes I'm trying to create.

I tried to create 100 x 1GB volumes, and 97 succeeded. However, I did get 3 failures:

$ euca-describe-volumes | grep failed
VOLUME vol-5FBB065C 1 cluster1 failed 2010-08-19T01:26:50.968Z
VOLUME vol-5957062E 1 cluster1 failed 2010-08-19T01:26:49.212Z
VOLUME vol-5FC3065C 1 cluster1 failed 2010-08-19T01:26:51.71Z

Revision history for this message
Neil Soman (neilsoman) wrote :

Marking it invalid against the latest dev branch (2.0), but I will try to reproduce it against 1.6.2.

Changed in eucalyptus:
status: New → Invalid
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

I have also tried this with a sleep 1, in case we have a race condition

for i in `seq 1 200` ; do euca-create-volume -z cluster1 -s 1; sleep 1 ; done

After creating "a bunch" (162) of volumes, I ended up getting:
Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error.
Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error.
Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error.

Revision history for this message
Neil Soman (neilsoman) wrote :

Against 1.6.2, I tried a test which involved creating and deleting 50 volumes in a tight loop, which passed successfully. With 100 volumes, I was able to trigger a failure to clean up volumes. In addition, once the max volume limit is reached and exceeded, occasionally, volumes are not cleaned up properly. I have seen the "Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error." message on occasion, when this happens.

None of these issues can be reproduced against 2.0. Version 2.0 has a number of fixes that have to deal with failure recovery and CLC/SC communication that helps scale. At this point, backporting these to 1.6.2 does not make a lot of sense (they are non trivial), since 2.0 will be released shortly, but the upstream is willing to discuss a possible backport effort if this is critical and upgrading to 2.0 is not an option.

mahmoh (mahmoh)
Changed in jaguarundi:
importance: Undecided → High
milestone: none → sales-training
milestone: sales-training → jaguar-ga
Revision history for this message
mahmoh (mahmoh) wrote : Re: eucalyptus create and delete volumes sometimes fail on lvm commands (POC + ENT configs)

This problem occurs on the all-in-one configuration.

May also occur on ENT config.

summary: - eucalyptus create and delete volumes sometimes fail on lvm commands
+ eucalyptus create and delete volumes sometimes fail on lvm commands (POC
+ + ENT configs)
Revision history for this message
C de-Avillez (hggdh2) wrote :

There are some points that should be kept in mind:

* this error is (quite possibly) a result of a race; the tests we usually conduct are stress tests -- in this case, we try to allocate a few hundreds of volumes in one single sweep -- we are requesting volume creation faster than volumes can be created, so we are more prone to the race condition(s).
* files are allocated in *sparse* mode -- so a newly-created 1Gbytes file will actually use just a few tens of Kbytes. This means it is much more difficult to run out of space (by at least two orders of magnitude).
* if the loop device is not unallocated, then there is a (perhaps higher) chance of running out of loop devices -- we have the system preset to allocate 512 such devices --. But, still, there is an easy bypass: just create more loop devices.

Revision history for this message
Neil Soman (neilsoman) wrote :

Fix has been applied to the 1.6.2 branch (revno 1235). There are a couple of issues here.

The first one has to do with concurrent creation of 100+ volumes in a tight loop. Some creates were failing (fix has been backported from 2.0).

The second issue was that it is possible to cause a race in LVM when concurrently deleting logical volumes and volume groups. Added a workaround for that.

Changed in eucalyptus:
status: Invalid → In Progress
Revision history for this message
Neil Soman (neilsoman) wrote :

Also note that any transient failures in deletion are reported correctly to the user now, i.e., a subsequent describe request will list any volumes that could not be deleted for any underlying back end reason. After this fix, the system should no longer be in an inconsistent state where the backend/SC thinks volumes are available but the user does not see them. The user has the option of retrying the delete request.

Revision history for this message
Neil Soman (neilsoman) wrote :

I've applied another fix (revno 1236, 1.6.2 branch) that makes cleanup of volumes more robust (1235 solves part of the problem). This has been tested on our end.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Marking fix-released in ubuntu/maverick, as this is fixed in Eucalyptus 2.0.

Changed in eucalyptus (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Assigning this to myself.

I'll get a package built and tested locally, and then pushed to lucid-proposed.

:-Dustin

Changed in eucalyptus (Ubuntu Lucid):
status: Confirmed → Triaged
assignee: nobody → Dustin Kirkland (kirkland)
Changed in jaguarundi:
assignee: nobody → Dustin Kirkland (kirkland)
status: Confirmed → Triaged
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Does anyone on this bug have a current setup affect by this bug that they could try upgrading to a candidate binary package that should fix this problem?

All of my clouds are Maverick at this point. I'm hoping someone here has a Lucid 10.04 cloud...

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

FYI, I just sent a build to the ~ubuntu-virt PPA. It should be building soon. Need someone to sniff test this before we push it to proposed.

Revision history for this message
Torsten Spindler (tspindler) wrote :

I can reproduce the problem on the training cloud:

$ for i in `seq 1 100`; do euca-create-volume -z TorstenCloud -s1; done
...
VOLUME vol-52B905E3 1 creating 2010-09-02T06:27:50.814Z
VOLUME vol-5F290652 1 creating 2010-09-02T06:27:51.142Z
VOLUME vol-598F062A 1 creating 2010-09-02T06:27:51.487Z
Volume: Error while communicating with Storage Controller: CreateStorageVolume:I
nternal Error.
Volume: Error while communicating with Storage Controller: CreateStorageVolume:I
nternal Error.
Volume: Error while communicating with Storage Controller: CreateStorageVolume:I
nternal Error.
...
$ euca-describe-volumes | grep failed
VOLUME vol-52B905E3 1 TorstenCloud failed 2010-09-02T06:27:50.814Z
$ euca-describe-volumes | grep unavailable | wc -l
13
$ euca-describe-volumes | grep available | grep -v unavailable | wc -l
48

I'll add ppa:ubuntu-virt/ppa to the sources.list and report what happens when re-running the test.

Revision history for this message
Neil Soman (neilsoman) wrote :

Make sure you start with a clean installation when you rerun the test. Thanks!

Changed in jaguarundi:
status: Triaged → In Progress
Changed in eucalyptus (Ubuntu Lucid):
status: Triaged → In Progress
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

I spent most of today testing this. My first package (ppa1) was missing the euca_rootwrap configuration changes necessary to allow the new LVM commands used by the upstream package. The ppa2 package I just uploaded adds that and solves this problem.

I'm happy with it so far. It has solve the problem for me, as far as I can tell. I have created/delete hundreds of volumes now and have not seen the failures I was seeing before.

I'm uploading to lucid-proposed now, and writing up the SRU justification in the description above.

summary: - eucalyptus create and delete volumes sometimes fail on lvm commands (POC
- + ENT configs)
+ [SRU] eucalyptus create and delete volumes sometimes fail on lvm
+ commands (POC + ENT configs)
description: updated
Changed in eucalyptus (Ubuntu Lucid):
milestone: none → ubuntu-10.04.2
Revision history for this message
Colin Watson (cjwatson) wrote : Please test proposed package

Accepted eucalyptus into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in eucalyptus (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
C de-Avillez (hggdh2) wrote :

As far as volume-creation/deletion goes, it seems to be running like a charm. The only small thing is when deleting volumes cloud-debug.log gets at least one error logged for each volume:

17:29:44 ERROR [SystemUtil:pool-10-thread-4] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop151 error: loop: can't get info on device /dev/loop151: No such device or address
com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop151 error: loop: can't get info on device /dev/loop151: No such device or address

Not sure this _is_ a problem, but here it is. Even with these errors being reported, creation/deletion succeeded for 1,586 volumes (and yes, there were 1,586 lines as above printed out in the cloud-debug log).

I am not yet tagging 'verification-done' because I still want to run an instance test.

Revision history for this message
C de-Avillez (hggdh2) wrote :

Oh, forgot, sorry: logs saved as usual:

Using saved push location: bzr+ssh://bazaar.launchpad.net/~hggdh2/%2Bjunk/uec-qa/
Pushed up to revision 52.

Revision history for this message
Neil Soman (neilsoman) wrote :

"17:29:44 ERROR [SystemUtil:pool-10-thread-4] com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop151 error: loop: can't get info on device /dev/loop151: No such device or address
com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap losetup /dev/loop151 error: loop: can't get info on device /dev/loop151: No such device or address"

Carlos that error is actually not a real "error," it is unfortunately tagged as such when logging. The Storage Controller is trying to make sure that the loop device exists before trying to use it or remove it, basically the equivalent of a "stat." We can fix the way this is being reported, but it is not critical.

Revision history for this message
C de-Avillez (hggdh2) wrote :

Cool. I could not see it as an error anyway, but it is good to have your confirmation. And I agree, not critical at all. Thank you, Neil.

Revision history for this message
C de-Avillez (hggdh2) wrote :

I consider this 'verification-done', but will not change the tag until OEM QA run their own tests.

Instance run tests results saved:

Using saved push location: bzr+ssh://bazaar.launchpad.net/~hggdh2/%2Bjunk/uec-qa/
Pushed up to revision 53.

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I've applied the fix from http://archive.ubuntu.com/ubuntu/pool/main/e/eucalyptus/eucalyptus-sc_1.6.2-0ubuntu30.4_amd64.deb and unfortunately the creation loop still reports some errors :

Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error.

Deletion loop iterates well but volumes are still present in /var/lib/eucalyptus/volumes and a quick 'sudo losetup -a' shows them all.

Changed in eucalyptus (Ubuntu):
status: Fix Released → Confirmed
Changed in jaguarundi:
status: In Progress → Confirmed
Revision history for this message
Neil Soman (neilsoman) wrote :

Sylvain, are you positive that you are running the right version? Did you start from a clean state? Are there any other error messages on the storage controller? Is lvdisplay, vgdisplay and pvdisplay added to the white list of commands?

We have tested this several times on our end and I'm puzzled that you'd still see this, but we'd be happy to take a look.

Revision history for this message
Dustin Kirkland  (kirkland) wrote : Re: [Bug 590929] Re: [SRU] eucalyptus create and delete volumes sometimes fail on lvm commands (POC + ENT configs)

Sylvain,

Did you install from scratch, or upgrade the broken system?

:-Dustin

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I've upgraded the system to euca* 1.6.2-0ubuntu30.4 (much better that only the SC). I hope that rev 1235/1236 are in this release but according to the delivery dates it should be.
So the status is better but I still see some errors while creating volumes with a loop (40) :

Volume: Error while communicating with Storage Controller: CreateStorageVolume:Internal Error.

With the creation of 40 volumes, this error is reported only one time (max 3 times).
The deletion loop does not report any errors and a vgscan shows a clean system, great.

If I run the creation loop several times I see in sc-stats.log that the amount of space used and the number of managed volumes never decrease. Above the limit (Disk space reserved for volumes) the previous error msg is returned for each creation request.

The *broken* system has been upgraded with lucid-proposed packages, rebooted before executing the test.

Revision history for this message
C de-Avillez (hggdh2) wrote :

Hi Sylvain,

Can you please attach the /var/log/eucalyptus of all the machines showing the error? Also, exact sequence of commands and output would help.

Thank you.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Neil wanted the testing to be done on a perfectly clean, newly installed system.

Neil, what do you make of Sylvain's test procedure?

Revision history for this message
Neil Soman (neilsoman) wrote :

Make sure you increase the disk space reserved for volumes (default is 50G) if you expect volume creation beyond that limit to succeed :)

Also, not sure I read the previous message correctly. There are fixes to the CLC as well as the SC so you will need to upgrade and restart both.

Revision history for this message
mahmoh (mahmoh) wrote :

Proper error messages are getting reported on upgraded systems without the residual artefacts that were reported in the bug: verified, -done.

Changed in jaguarundi:
status: Confirmed → Fix Released
mahmoh (mahmoh)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

I've performed a full re-installation of the frontend as requested.
Before any tests or registration, I've updated all eucalyptus packages to 1.6.2-0ubuntu30.4.
I've also adjusted cluster1.storage.maxtotalvolumesizeingb in the web front end (set to 250).

I did 10 test loops (as usual creating 40 volumes).

 + No Internal Errors were reported.
 + Space available and # of volumes correctly reported in sc-stats.log
 + Deletion worked perfectly

But I found one (just one) failure reported by euca-describe-volumes, 10 seconds after the creation loop :
VOLUME vol-526005D5 1 cluster1 failed 2010-09-15T17:00:32.7Z

A few seconds later, euca-describe-volumes didn't report anything about this failure, weird. But I had only 39 volumes created.

This is what was reported in cloud-error.log:

12:00:34 [SystemUtil:pool-10-thread-4] ERROR com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap vgcreate vg-lp4W3A.. /dev/loop18 error: /var/lock/lvm/P_orphans: flock failed: Resource temporarily unavailable
  Can't get lock for orphan PVs

com.eucalyptus.util.ExecutionException: ///usr/lib/eucalyptus/euca_rootwrap vgcreate vg-lp4W3A.. /dev/loop18 error: /var/lock/lvm/P_orphans: flock failed: Resource temporarily unavailable
  Can't get lock for orphan PVs

        at edu.ucsb.eucalyptus.util.SystemUtil.run(SystemUtil.java:91)
        at edu.ucsb.eucalyptus.storage.LVM2Manager.createVolumeGroup(LVM2Manager.java:170)
        at edu.ucsb.eucalyptus.storage.LVM2Manager.createLogicalVolume(LVM2Manager.java:430)
        at edu.ucsb.eucalyptus.storage.LVM2Manager.createVolume(LVM2Manager.java:476)
        at edu.ucsb.eucalyptus.cloud.ws.BlockStorage$VolumeCreator.run(BlockStorage.java:798)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)
12:00:34 [BlockStorage:pool-10-thread-4] ERROR com.eucalyptus.util.EucalyptusCloudException: Unable to create volume group vg-lp4W3A.. for /dev/loop18

Changed in eucalyptus (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6.2-0ubuntu30.4

---------------
eucalyptus (1.6.2-0ubuntu30.4) lucid-proposed; urgency=low

  * debian/patches/590929.diff: cherry pick upstream bzr r1235..1236 to
    fix LP: #590929; volume creation/deletion issues; add newly used
    LVM commands to wrappers.conf
 -- Dustin Kirkland <email address hidden> Wed, 01 Sep 2010 17:33:09 -0600

Changed in eucalyptus (Ubuntu Lucid):
status: Fix Committed → Fix Released
Andy Grimm (agrimm)
Changed in eucalyptus:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.