Merge lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach into lp:charms/block-storage-broker

Proposed by Chad Smith on 2014-08-22
Status: Merged
Merged at revision: 56
Proposed branch: lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach
Merge into: lp:charms/block-storage-broker
Diff against target: 217 lines (+78/-34)
3 files modified
hooks/test_hooks.py (+4/-4)
hooks/test_util.py (+38/-13)
hooks/util.py (+36/-17)
To merge this branch: bzr merge lp:~chad.smith/charms/precise/block-storage-broker/bsb-retries-on-volume-create-and-attach
Reviewer Review Type Date Requested Status
Chad Smith (community) Approve on 2014-09-09
David Britton (community) 2014-08-22 Needs Fixing on 2014-09-09
Paul Larson 2014-08-22 Pending
Review via email: mp+231974@code.launchpad.net

Description of the change

To attempt to avoid spurious errors from openstack on nova volume-create and volume-attach commands, this branch adds a retry mechanism for those two commands run by the block storage broker.

This branch introduces an internal _run_command method to reunse retry and error handling logic around commands that are known to produce intermittent 500 errors.

_run_command attempts 3 retries on command failure and logs WARNINGs with each failed attempt. If the command fails 4 times in a row an ERROR is logged and we exit(1) to break the hook.

To keep this merge proposal smaller, we only use _run_command is only being used for "nova volume-create" and "nova volume-attach" methods. Subsequent branches will pull in other nova and euca2ools commands as we refine the error handling and retries needed.

To post a comment you must log in.
David Britton (davidpbritton) wrote :

[0] in _def_run_command(), why split the output into lines in a list? I think doing an rstrip() on the result would be enough..

David Britton (davidpbritton) wrote :

[1] I would increase the retry time to 30s, 5s seems too frequent.

Just two minor thing to fix. I tested both the unit tests and deploying on openstack, all worked great.

Thanks for the contribution. Hopefully this will make it more stable. If not, we can easily increase the retry count as well.

I like it.

review: Needs Fixing
Paul Larson (pwlars) wrote :

This is working better for me. I don't get the HTTP 500 errors anymore, but I still occasionally hit the other problem I described where it tries to create a new volume when the existing one is already there.

Chad Smith (chad.smith) wrote :

> [0] in _def_run_command(), why split the output into lines in a list? I think
> doing an rstrip() on the result would be enough..

The output was being split into lines to pre-enable a follow-on branch which would use _run_command for our "nova volume-show" calls. This volume-show has multi-line output and splitting it in the _run_command call would make parsing of that output a bit simpler. I'll pull it out of this branch though as nothing "currently" uses that functionality.

Chad Smith (chad.smith) wrote :

> This is working better for me. I don't get the HTTP 500 errors anymore, but I
> still occasionally hit the other problem I described where it tries to create
> a new volume when the existing one is already there.

Thanks Paul, I think we might have to open a separate bug for that duplicate volume created. I'm trying to understand your redeployment steps so I can see this error with more debug statements added.

It sounded like you deployed successfully once, then tore down your environment or the units, and then redeployed. Did you juju destroy-environment or juju destroy-service postgresql? or just remove-unit postgresql/0?

Thanks for the additional info.

review: Approve
David Britton (davidpbritton) wrote :

Hi Chad -- Thanks for doing this.

Please resolve the merge conflicts and I'll commit up for both trusty and precise.

review: Needs Fixing
David Britton (davidpbritton) wrote :

@Paul, @Chad, Please follow on with a separate bug for the other issue you are seeing, Thanks!

Chad Smith (chad.smith) wrote :

Just pulled trunk and resolved the test conflict. changes are now pushed thanks David.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'hooks/test_hooks.py'
2--- hooks/test_hooks.py 2014-08-22 08:05:32 +0000
3+++ hooks/test_hooks.py 2014-09-09 16:24:59 +0000
4@@ -182,11 +182,11 @@
5 self.mocker.replay()
6 hooks.config_changed()
7
8- def test_install_installs_novaclient_without_cloud_archive(self):
9+ def test_install_installs_novaclient_and_no_cloud_archive_on_trusty(self):
10 """
11- On releases C{trusty} and later L{install} will install the
12- python-novaclient package without installing the cloud-archive
13- repository.
14+ On trusty, 14.04, and later, L{install} will not call
15+ C{fetch.add_source} to add a cloud repository but it will install the
16+ install the C{python-novaclient} package.
17 """
18 get_running_series = self.mocker.replace(hooks.get_running_series)
19 get_running_series()
20
21=== modified file 'hooks/test_util.py'
22--- hooks/test_util.py 2014-08-22 13:41:59 +0000
23+++ hooks/test_util.py 2014-09-09 16:24:59 +0000
24@@ -482,6 +482,7 @@
25 result = self.assertRaises(
26 SystemExit, self.storage._nova_describe_volumes)
27 self.assertEqual(result.code, 1)
28+
29 message = (
30 "ERROR: Command '%s' returned non-zero exit status 1" % command)
31 self.assertIn(
32@@ -632,7 +633,7 @@
33 command = (
34 "nova volume-create --display-name '%s' %s" % (volume_label, size))
35
36- create = self.mocker.replace(subprocess.check_call)
37+ create = self.mocker.replace(subprocess.check_output)
38 create(command, shell=True)
39 self.mocker.replay()
40
41@@ -658,7 +659,7 @@
42 command = (
43 "nova volume-create --display-name '%s' %s" % (volume_label, size))
44
45- create = self.mocker.replace(subprocess.check_call)
46+ create = self.mocker.replace(subprocess.check_output)
47 create(command, shell=True)
48 self.mocker.replay()
49 self.storage.get_volume_id = lambda label: None
50@@ -673,10 +674,11 @@
51 self.assertIn(
52 message, util.hookenv._log_ERROR, "Not logged- %s" % message)
53
54- def test_wb_nova_create_volume_error_command_failed(self):
55+ def test_wb_nova_create_volume_error_command_failed_with_retries(self):
56 """
57- L{_nova_create_volume} will log an error and exit when
58- C{nova create-volume} command fails.
59+ L{_nova_create_volume} will log warnings and retry 3 times when an
60+ error is raised by the command C{nova create-volume}. Upon failure of
61+ the third retry, L{_nova_create_volume} will log an error and exit 1.
62 """
63 instance_id = "i-123123"
64 volume_label = "postgresql/0 unit volume"
65@@ -684,17 +686,28 @@
66 command = (
67 "nova volume-create --display-name '%s' %s" % (volume_label, size))
68
69- create = self.mocker.replace(subprocess.check_call)
70+ create = self.mocker.replace(subprocess.check_output)
71 create(command, shell=True)
72+ self.mocker.count(4)
73 self.mocker.throw(subprocess.CalledProcessError(1, command))
74+ sleep = self.mocker.replace("time.sleep")
75+ sleep(30)
76+ self.mocker.count(3)
77 self.mocker.replay()
78
79 result = self.assertRaises(
80 SystemExit, self.storage._nova_create_volume, size, volume_label,
81 instance_id)
82 self.assertEqual(result.code, 1)
83- message = (
84- "ERROR: Command '%s' returned non-zero exit status 1" % command)
85+ self.assertEqual(len(util.hookenv._log_WARNING), 3)
86+ message = (
87+ "WARNING: Command '%s' returned non-zero exit status 1. "
88+ "Retrying 3 more times" % command)
89+ self.assertIn(
90+ message, util.hookenv._log_WARNING, "Not logged- %s" % message)
91+
92+ message = (
93+ "ERROR: Command '%s' returned non-zero exit status 1." % command)
94 self.assertIn(
95 message, util.hookenv._log_ERROR, "Not logged- %s" % message)
96
97@@ -738,10 +751,12 @@
98 self.storage._nova_attach_volume(instance_id, volume_id),
99 "")
100
101- def test_wb_nova_attach_volume_command_error(self):
102+ def test_wb_nova_attach_volume_command_error_retries_three_times(self):
103 """
104- L{_nova_attach_volume} will exit in error when the
105- C{nova volume-attach} command fails.
106+ L{_nova_attach_volume} will warn on command error and retry the command
107+ 3 times with sleeps in between. When the command C{nova volume-attach}
108+ exits in error on the third retry, an error is logged and the method
109+ exits 1.
110 """
111 instance_id = "i-123123"
112 volume_id = "123-123-123"
113@@ -750,15 +765,25 @@
114 (instance_id, volume_id))
115 attach = self.mocker.replace(subprocess.check_output)
116 attach(command, shell=True)
117+ self.mocker.count(4)
118 self.mocker.throw(subprocess.CalledProcessError(1, command))
119+ sleep = self.mocker.replace("time.sleep")
120+ sleep(30)
121+ self.mocker.count(3)
122 self.mocker.replay()
123
124 result = self.assertRaises(
125 SystemExit, self.storage._nova_attach_volume, instance_id,
126 volume_id)
127 self.assertEqual(result.code, 1)
128- message = (
129- "ERROR: Command '%s' returned non-zero exit status 1" % command)
130+ self.assertEqual(len(util.hookenv._log_WARNING), 3)
131+ message = (
132+ "WARNING: Command '%s' returned non-zero exit status 1. "
133+ "Retrying 3 more times" % command)
134+ self.assertIn(
135+ message, util.hookenv._log_WARNING, "Not logged- %s" % message)
136+ message = (
137+ "ERROR: Command '%s' returned non-zero exit status 1." % command)
138 self.assertIn(
139 message, util.hookenv._log_ERROR, "Not logged- %s" % message)
140
141
142=== modified file 'hooks/util.py'
143--- hooks/util.py 2014-08-22 07:55:38 +0000
144+++ hooks/util.py 2014-09-09 16:24:59 +0000
145@@ -43,6 +43,35 @@
146 self.required_config_options = REQUIRED_CONFIG_OPTIONS[provider]
147 self.ec2_conn = None
148
149+ def _run_command(self, command, retries=0):
150+ """Run the provided command and return output stripped of whitespace.
151+
152+ On command failed, retry the provided number of C{retries} or exit(1)
153+ with an error message.
154+ """
155+ command_failed = True
156+ for x in range(1 + retries):
157+ try:
158+ output = subprocess.check_output(command, shell=True)
159+ except subprocess.CalledProcessError, e:
160+ remaining_retries = retries - x
161+ if remaining_retries:
162+ message = (
163+ "WARNING: %s. Retrying %d more times" %
164+ (str(e), remaining_retries))
165+ hookenv.log(message, hookenv.WARNING)
166+ sleep(30)
167+ else:
168+ command_failed = False
169+ break
170+ if command_failed:
171+ hookenv.log("ERROR: %s." % str(e), hookenv.ERROR)
172+ sys.exit(1)
173+
174+ if output:
175+ return output.rstrip()
176+ return ""
177+
178 def load_environment(self):
179 """
180 Source our credentials from the configuration definitions into our
181@@ -466,29 +495,19 @@
182 Attach a Nova C{volume_id} to the provided C{instance_id} and return
183 the device path.
184 """
185- try:
186- device = subprocess.check_output(
187- "nova volume-attach %s %s auto | egrep -o \"/dev/vd[b-z]\"" %
188- (instance_id, volume_id), shell=True)
189- except subprocess.CalledProcessError, e:
190- hookenv.log("ERROR: %s" % str(e), hookenv.ERROR)
191- sys.exit(1)
192- if device.strip():
193- return device.strip()
194- return ""
195+ device = self._run_command(
196+ "nova volume-attach %s %s auto | egrep -o \"/dev/vd[b-z]\"" %
197+ (instance_id, volume_id), retries=3)
198+ return device
199
200 def _nova_create_volume(self, size, volume_label, instance_id):
201 """Create an Nova volume with a specific C{size} and C{volume_label}"""
202 hookenv.log(
203 "Creating a %sGig volume named (%s) for instance %s" %
204 (size, volume_label, instance_id))
205- try:
206- subprocess.check_call(
207- "nova volume-create --display-name '%s' %s" %
208- (volume_label, size), shell=True)
209- except subprocess.CalledProcessError, e:
210- hookenv.log("ERROR: %s" % str(e), hookenv.ERROR)
211- sys.exit(1)
212+ self._run_command(
213+ "nova volume-create --display-name '%s' %s" %
214+ (volume_label, size), retries=3)
215
216 volume_id = self.get_volume_id(volume_label)
217 if not volume_id:

Subscribers

People subscribed via source and target branches

to all changes: