Juju Charms Collection
block-storage-broker package

Merge lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach into lp:charms/block-storage-broker

Proposed by Chad Smith on 2014-08-22

Status:

Merged

Merged at revision:

Proposed branch:

lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach

Merge into:

lp:charms/block-storage-broker

Diff against target:

217 lines (+78/-34)

3 files modified

hooks/test_hooks.py (+4/-4)
hooks/test_util.py (+38/-13)
hooks/util.py (+36/-17)

To merge this branch:

bzr merge lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach

High

Fix Committed

Link a bug report

Reviewer	Date Requested	Status
Chad Smith (community)		Approve on 2014-09-09
David Britton (community)	2014-08-22	Needs Fixing on 2014-09-09
Paul Larson	2014-08-22	Pending
Review via email: mp+231974@code.launchpad.net

Description of the change

To attempt to avoid spurious errors from openstack on nova volume-create and volume-attach commands, this branch adds a retry mechanism for those two commands run by the block storage broker.

This branch introduces an internal _run_command method to reunse retry and error handling logic around commands that are known to produce intermittent 500 errors.

_run_command attempts 3 retries on command failure and logs WARNINGs with each failed attempt. If the command fails 4 times in a row an ERROR is logged and we exit(1) to break the hook.

To keep this merge proposal smaller, we only use _run_command is only being used for "nova volume-create" and "nova volume-attach" methods. Subsequent branches will pull in other nova and euca2ools commands as we refine the error handling and retries needed.

Revision history for this message

David Britton (dpb) wrote on 2014-08-23:

[0] in _def_run_command(), why split the output into lines in a list? I think doing an rstrip() on the result would be enough..

Revision history for this message

David Britton (dpb) wrote on 2014-08-25:

[1] I would increase the retry time to 30s, 5s seems too frequent.

Just two minor thing to fix. I tested both the unit tests and deploying on openstack, all worked great.

Thanks for the contribution. Hopefully this will make it more stable. If not, we can easily increase the retry count as well.

I like it.

review: Needs Fixing

Revision history for this message

Paul Larson (pwlars) wrote on 2014-08-26:

This is working better for me. I don't get the HTTP 500 errors anymore, but I still occasionally hit the other problem I described where it tries to create a new volume when the existing one is already there.

Revision history for this message

Chad Smith (chad.smith) wrote on 2014-08-26:

> [0] in _def_run_command(), why split the output into lines in a list? I think
> doing an rstrip() on the result would be enough..

The output was being split into lines to pre-enable a follow-on branch which would use _run_command for our "nova volume-show" calls. This volume-show has multi-line output and splitting it in the _run_command call would make parsing of that output a bit simpler. I'll pull it out of this branch though as nothing "currently" uses that functionality.

Revision history for this message

Chad Smith (chad.smith) wrote on 2014-08-26:

> This is working better for me. I don't get the HTTP 500 errors anymore, but I
> still occasionally hit the other problem I described where it tries to create
> a new volume when the existing one is already there.

Thanks Paul, I think we might have to open a separate bug for that duplicate volume created. I'm trying to understand your redeployment steps so I can see this error with more debug statements added.

It sounded like you deployed successfully once, then tore down your environment or the units, and then redeployed. Did you juju destroy-environment or juju destroy-service postgresql? or just remove-unit postgresql/0?

Thanks for the additional info.

Revision history for this message

David Britton (dpb) on 2014-08-26:

review: Approve

Revision history for this message

David Britton (dpb) wrote on 2014-09-09:

Hi Chad -- Thanks for doing this.

Please resolve the merge conflicts and I'll commit up for both trusty and precise.

review: Needs Fixing

Revision history for this message

David Britton (dpb) wrote on 2014-09-09:

@Paul, @Chad, Please follow on with a separate bug for the other issue you are seeing, Thanks!

Revision history for this message

Chad Smith (chad.smith) wrote on 2014-09-09:

Just pulled trunk and resolved the test conflict. changes are now pushed thanks David.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Chad Smith

charmers

 === modified file 'hooks/test_hooks.py'
 --- hooks/test_hooks.py	2014-08-22 08:05:32 +0000
 +++ hooks/test_hooks.py	2014-09-09 16:24:59 +0000
@@ -182,11 +182,11 @@
          self.mocker.replay()
          hooks.config_changed()
--    def test_install_installs_novaclient_without_cloud_archive(self):
++    def test_install_installs_novaclient_and_no_cloud_archive_on_trusty(self):
          """
--        On releases C{trusty} and later L{install} will install the
--        python-novaclient package without installing the cloud-archive
--        repository.
++        On trusty, 14.04, and later, L{install} will not call
++        C{fetch.add_source} to add a cloud repository but it will install the
++        install the C{python-novaclient} package.
          """
          get_running_series = self.mocker.replace(hooks.get_running_series)
          get_running_series()
 === modified file 'hooks/test_util.py'
 --- hooks/test_util.py	2014-08-22 13:41:59 +0000
 +++ hooks/test_util.py	2014-09-09 16:24:59 +0000
@@ -482,6 +482,7 @@
          result = self.assertRaises(
              SystemExit, self.storage._nova_describe_volumes)
          self.assertEqual(result.code, 1)
++
          message = (
              "ERROR: Command '%s' returned non-zero exit status 1" % command)
          self.assertIn(
@@ -632,7 +633,7 @@
          command = (
              "nova volume-create --display-name '%s' %s" % (volume_label, size))
--        create = self.mocker.replace(subprocess.check_call)
++        create = self.mocker.replace(subprocess.check_output)
          create(command, shell=True)
          self.mocker.replay()
@@ -658,7 +659,7 @@
          command = (
              "nova volume-create --display-name '%s' %s" % (volume_label, size))
--        create = self.mocker.replace(subprocess.check_call)
++        create = self.mocker.replace(subprocess.check_output)
          create(command, shell=True)
          self.mocker.replay()
          self.storage.get_volume_id = lambda label: None
@@ -673,10 +674,11 @@
          self.assertIn(
              message, util.hookenv._log_ERROR, "Not logged- %s" % message)
--    def test_wb_nova_create_volume_error_command_failed(self):
++    def test_wb_nova_create_volume_error_command_failed_with_retries(self):
          """
--        L{_nova_create_volume} will log an error and exit when
--        C{nova create-volume} command fails.
++        L{_nova_create_volume} will log warnings and retry 3 times when an
++        error is raised by the command  C{nova create-volume}. Upon failure of
++        the third retry, L{_nova_create_volume} will log an error and exit 1.
          """
          instance_id = "i-123123"
          volume_label = "postgresql/0 unit volume"
@@ -684,17 +686,28 @@
          command = (
              "nova volume-create --display-name '%s' %s" % (volume_label, size))
--        create = self.mocker.replace(subprocess.check_call)
++        create = self.mocker.replace(subprocess.check_output)
          create(command, shell=True)
++        self.mocker.count(4)
          self.mocker.throw(subprocess.CalledProcessError(1, command))
++        sleep = self.mocker.replace("time.sleep")
++        sleep(30)
++        self.mocker.count(3)
          self.mocker.replay()
          result = self.assertRaises(
              SystemExit, self.storage._nova_create_volume, size, volume_label,
              instance_id)
          self.assertEqual(result.code, 1)
--        message = (
--            "ERROR: Command '%s' returned non-zero exit status 1" % command)
++        self.assertEqual(len(util.hookenv._log_WARNING), 3)
++        message = (
++            "WARNING: Command '%s' returned non-zero exit status 1. "
++            "Retrying 3 more times" % command)
++        self.assertIn(
++            message, util.hookenv._log_WARNING, "Not logged- %s" % message)
++
++        message = (
++            "ERROR: Command '%s' returned non-zero exit status 1." % command)
          self.assertIn(
              message, util.hookenv._log_ERROR, "Not logged- %s" % message)
@@ -738,10 +751,12 @@
              self.storage._nova_attach_volume(instance_id, volume_id),
              "")
--    def test_wb_nova_attach_volume_command_error(self):
++    def test_wb_nova_attach_volume_command_error_retries_three_times(self):
          """
--        L{_nova_attach_volume} will exit in error when the
--        C{nova volume-attach} command fails.
++        L{_nova_attach_volume} will warn on command error and retry the command
++        3 times with sleeps in between. When the command C{nova volume-attach}
++        exits in error on the third retry, an error is logged and the method
++        exits 1.
          """
          instance_id = "i-123123"
          volume_id = "123-123-123"
@@ -750,15 +765,25 @@
              (instance_id, volume_id))
          attach = self.mocker.replace(subprocess.check_output)
          attach(command, shell=True)
++        self.mocker.count(4)
          self.mocker.throw(subprocess.CalledProcessError(1, command))
++        sleep = self.mocker.replace("time.sleep")
++        sleep(30)
++        self.mocker.count(3)
          self.mocker.replay()
          result = self.assertRaises(
              SystemExit, self.storage._nova_attach_volume, instance_id,
              volume_id)
          self.assertEqual(result.code, 1)
--        message = (
--            "ERROR: Command '%s' returned non-zero exit status 1" % command)
++        self.assertEqual(len(util.hookenv._log_WARNING), 3)
++        message = (
++            "WARNING: Command '%s' returned non-zero exit status 1. "
++            "Retrying 3 more times" % command)
++        self.assertIn(
++            message, util.hookenv._log_WARNING, "Not logged- %s" % message)
++        message = (
++            "ERROR: Command '%s' returned non-zero exit status 1." % command)
          self.assertIn(
              message, util.hookenv._log_ERROR, "Not logged- %s" % message)
 === modified file 'hooks/util.py'
 --- hooks/util.py	2014-08-22 07:55:38 +0000
 +++ hooks/util.py	2014-09-09 16:24:59 +0000
@@ -43,6 +43,35 @@
          self.required_config_options = REQUIRED_CONFIG_OPTIONS[provider]
          self.ec2_conn = None
++    def _run_command(self, command, retries=0):
++        """Run the provided command and return output stripped of whitespace.
++
++        On command failed, retry the provided number of C{retries} or exit(1)
++        with an error message.
++        """
++        command_failed = True
++        for x in range(1 + retries):
++            try:
++                output = subprocess.check_output(command, shell=True)
++            except subprocess.CalledProcessError, e:
++                remaining_retries = retries - x
++                if remaining_retries:
++                    message = (
++                        "WARNING: %s. Retrying %d more times" %
++                        (str(e), remaining_retries))
++                    hookenv.log(message, hookenv.WARNING)
++                    sleep(30)
++            else:
++                command_failed = False
++                break
++        if command_failed:
++            hookenv.log("ERROR: %s." % str(e), hookenv.ERROR)
++            sys.exit(1)
++
++        if output:
++            return output.rstrip()
++        return ""
++
      def load_environment(self):
          """
          Source our credentials from the configuration definitions into our
@@ -466,29 +495,19 @@
          Attach a Nova C{volume_id} to the provided C{instance_id} and return
          the device path.
          """
--        try:
--            device = subprocess.check_output(
--                "nova volume-attach %s %s auto | egrep -o \"/dev/vd[b-z]\"" %
--                (instance_id, volume_id), shell=True)
--        except subprocess.CalledProcessError, e:
--            hookenv.log("ERROR: %s" % str(e), hookenv.ERROR)
--            sys.exit(1)
--        if device.strip():
--            return device.strip()
--        return ""
++        device = self._run_command(
++            "nova volume-attach %s %s auto | egrep -o \"/dev/vd[b-z]\"" %
++            (instance_id, volume_id), retries=3)
++        return device
      def _nova_create_volume(self, size, volume_label, instance_id):
          """Create an Nova volume with a specific C{size} and C{volume_label}"""
          hookenv.log(
              "Creating a %sGig volume named (%s) for instance %s" %
              (size, volume_label, instance_id))
--        try:
--            subprocess.check_call(
--                "nova volume-create --display-name '%s' %s" %
--                (volume_label, size), shell=True)
--        except subprocess.CalledProcessError, e:
--            hookenv.log("ERROR: %s" % str(e), hookenv.ERROR)
--            sys.exit(1)
++        self._run_command(
++            "nova volume-create --display-name '%s' %s" %
++            (volume_label, size), retries=3)
          volume_id = self.get_volume_id(volume_label)
          if not volume_id:

Juju Charms Collectionblock-storage-broker package

Merge lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach into lp:charms/block-storage-broker

Commit message

Description of the change

Preview Diff

Subscribers

Juju Charms Collection
block-storage-broker package