Merge lp:~xfactor973/charms/trusty/ceph-osd/coordinated-upgrade into lp:~openstack-charmers-archive/charms/trusty/ceph-osd/next
- Trusty Tahr (14.04)
- coordinated-upgrade
- Merge into next
Status: | Needs review |
---|---|
Proposed branch: | lp:~xfactor973/charms/trusty/ceph-osd/coordinated-upgrade |
Merge into: | lp:~openstack-charmers-archive/charms/trusty/ceph-osd/next |
Diff against target: |
600 lines (+387/-18) 5 files modified
.bzrignore (+1/-0) hooks/ceph.py (+155/-9) hooks/ceph_hooks.py (+199/-5) hooks/utils.py (+31/-3) templates/ceph.conf (+1/-1) |
To merge this branch: | bzr merge lp:~xfactor973/charms/trusty/ceph-osd/coordinated-upgrade |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
James Page | Needs Fixing | ||
Chris MacNaughton | Pending | ||
Review via email: mp+287376@code.launchpad.net |
Commit message
Description of the change
This patch allows the ceph osd cluster to upgrade themselves one by one. It does this by using the ceph monitor cluster as a locking mechanism. There are most likely edge cases with this method that I haven't thought of. Consider this code to be lightly tested. It worked fine on ec2.
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #1557 ceph-osd-next for xfactor973 mp287376
LINT FAIL: lint-test failed
LINT Results (max last 2 lines):
make: *** [lint] Error 1
ERROR:root:Make target returned non-zero.
Full lint test output: http://
Build: http://
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #1305 ceph-osd-next for xfactor973 mp287376
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #554 ceph-osd-next for xfactor973 mp287376
AMULET FAIL: amulet-test failed
AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.
Full amulet test output: http://
Build: http://
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_unit_test #1308 ceph-osd-next for xfactor973 mp287376
UNIT OK: passed
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_lint_check #1562 ceph-osd-next for xfactor973 mp287376
LINT FAIL: lint-test failed
LINT Results (max last 2 lines):
make: *** [lint] Error 1
ERROR:root:Make target returned non-zero.
Full lint test output: http://
Build: http://
uosci-testing-bot (uosci-testing-bot) wrote : | # |
charm_amulet_test #558 ceph-osd-next for xfactor973 mp287376
AMULET FAIL: amulet-test failed
AMULET Results (max last 2 lines):
make: *** [functional_test] Error 1
ERROR:root:Make target returned non-zero.
Full amulet test output: http://
Build: http://
James Page (james-page) wrote : | # |
I think most of my comments on the ceph-mon proposal also apply here.
- 70. By Chris Holcombe
-
Add back in monitor pieces. Will separate out into another MP
Unmerged revisions
- 70. By Chris Holcombe
-
Add back in monitor pieces. Will separate out into another MP
- 69. By Chris Holcombe
-
Hash the hostname instead of the ip address. That is more portable. Works now on lxc and also on ec2
- 68. By Chris Holcombe
-
Merge upstream
- 67. By Chris Holcombe
-
It rolls!. This now upgrades and rolls the ceph osd cluster one by one
Preview Diff
1 | === modified file '.bzrignore' | |||
2 | --- .bzrignore 2015-10-30 02:23:36 +0000 | |||
3 | +++ .bzrignore 2016-03-01 16:57:01 +0000 | |||
4 | @@ -3,3 +3,4 @@ | |||
5 | 3 | .tox | 3 | .tox |
6 | 4 | .testrepository | 4 | .testrepository |
7 | 5 | bin | 5 | bin |
8 | 6 | .idea | ||
9 | 6 | 7 | ||
10 | === modified file 'hooks/ceph.py' | |||
11 | --- hooks/ceph.py 2016-01-29 07:31:13 +0000 | |||
12 | +++ hooks/ceph.py 2016-03-01 16:57:01 +0000 | |||
13 | @@ -1,4 +1,3 @@ | |||
14 | 1 | |||
15 | 2 | # | 1 | # |
16 | 3 | # Copyright 2012 Canonical Ltd. | 2 | # Copyright 2012 Canonical Ltd. |
17 | 4 | # | 3 | # |
18 | @@ -6,20 +5,19 @@ | |||
19 | 6 | # James Page <james.page@canonical.com> | 5 | # James Page <james.page@canonical.com> |
20 | 7 | # Paul Collins <paul.collins@canonical.com> | 6 | # Paul Collins <paul.collins@canonical.com> |
21 | 8 | # | 7 | # |
22 | 9 | |||
23 | 10 | import json | 8 | import json |
24 | 11 | import subprocess | 9 | import subprocess |
25 | 12 | import time | 10 | import time |
26 | 13 | import os | 11 | import os |
27 | 14 | import re | 12 | import re |
28 | 15 | import sys | 13 | import sys |
29 | 14 | import errno | ||
30 | 16 | from charmhelpers.core.host import ( | 15 | from charmhelpers.core.host import ( |
31 | 17 | mkdir, | 16 | mkdir, |
32 | 18 | chownr, | 17 | chownr, |
33 | 19 | service_restart, | ||
34 | 20 | cmp_pkgrevno, | 18 | cmp_pkgrevno, |
37 | 21 | lsb_release | 19 | lsb_release, |
38 | 22 | ) | 20 | service_restart) |
39 | 23 | from charmhelpers.core.hookenv import ( | 21 | from charmhelpers.core.hookenv import ( |
40 | 24 | log, | 22 | log, |
41 | 25 | ERROR, | 23 | ERROR, |
42 | @@ -54,6 +52,137 @@ | |||
43 | 54 | return "root" | 52 | return "root" |
44 | 55 | 53 | ||
45 | 56 | 54 | ||
46 | 55 | class CrushLocation(object): | ||
47 | 56 | def __init__(self, name, identifier, host, rack, row, datacenter, chassis, root): | ||
48 | 57 | self.name = name | ||
49 | 58 | self.identifier = identifier | ||
50 | 59 | self.host = host | ||
51 | 60 | self.rack = rack | ||
52 | 61 | self.row = row | ||
53 | 62 | self.datacenter = datacenter | ||
54 | 63 | self.chassis = chassis | ||
55 | 64 | self.root = root | ||
56 | 65 | |||
57 | 66 | |||
58 | 67 | """ | ||
59 | 68 | {"nodes":[{"id":-1,"name":"default","type":"root","type_id":10,"children":[-4,-3,-2]},{"id":-2,"name":"ip-172-31-10-122","type":"host","type_id":1,"children":[0]},{"id":0,"name":"osd.0","exists":1,"type":"osd","type_id":0,"status":"up","reweight":"1.000000","crush_weight":"1.000000","depth":2},{"id":-3,"name":"ip-172-31-25-187","type":"host","type_id":1,"children":[1]},{"id":1,"name":"osd.1","exists":1,"type":"osd","type_id":0,"status":"up","reweight":"1.000000","crush_weight":"1.000000","depth":2},{"id":-4,"name":"ip-172-31-38-24","type":"host","type_id":1,"children":[2]},{"id":2,"name":"osd.2","exists":1,"type":"osd","type_id":0,"status":"up","reweight":"1.000000","crush_weight":"1.000000","depth":2}],"stray":[]} | ||
60 | 69 | """ | ||
61 | 70 | |||
62 | 71 | |||
63 | 72 | def get_osd_tree(): | ||
64 | 73 | """ | ||
65 | 74 | Returns the current osd map in JSON. | ||
66 | 75 | :return: JSON String. :raise: ValueError if the monmap fails to parse. | ||
67 | 76 | Also raises CalledProcessError if our ceph command fails | ||
68 | 77 | """ | ||
69 | 78 | try: | ||
70 | 79 | tree = subprocess.check_output( | ||
71 | 80 | ['sudo', '-u', ceph_user(), | ||
72 | 81 | 'ceph', 'osd', 'tree', '--format=json']) | ||
73 | 82 | try: | ||
74 | 83 | json_tree = json.loads(tree) | ||
75 | 84 | crush_list = [] | ||
76 | 85 | # Make sure children are present in the json | ||
77 | 86 | if not json_tree['nodes']: | ||
78 | 87 | return None | ||
79 | 88 | child_ids = json_tree['nodes'][0]['children'] | ||
80 | 89 | for child in json_tree['nodes']: | ||
81 | 90 | if child['id'] in child_ids: | ||
82 | 91 | crush_list.append( | ||
83 | 92 | CrushLocation( | ||
84 | 93 | name=child.get('name'), | ||
85 | 94 | identifier=child['id'], | ||
86 | 95 | host=child.get('host'), | ||
87 | 96 | rack=child.get('rack'), | ||
88 | 97 | row=child.get('row'), | ||
89 | 98 | datacenter=child.get('datacenter'), | ||
90 | 99 | chassis=child.get('chassis'), | ||
91 | 100 | root=child.get('root') | ||
92 | 101 | ) | ||
93 | 102 | ) | ||
94 | 103 | return crush_list | ||
95 | 104 | except ValueError as v: | ||
96 | 105 | log("Unable to parse ceph tree json: {}. Error: {}".format( | ||
97 | 106 | tree, v.message)) | ||
98 | 107 | raise | ||
99 | 108 | except subprocess.CalledProcessError as e: | ||
100 | 109 | log("ceph osd tree command failed with message: {}".format( | ||
101 | 110 | e.message)) | ||
102 | 111 | raise | ||
103 | 112 | |||
104 | 113 | |||
105 | 114 | def monitor_key_delete(key): | ||
106 | 115 | """ | ||
107 | 116 | Deletes a key value pair on the monitor cluster. | ||
108 | 117 | :param key: String. The key to delete. | ||
109 | 118 | """ | ||
110 | 119 | try: | ||
111 | 120 | subprocess.check_output( | ||
112 | 121 | ['sudo', '-u', ceph_user(), | ||
113 | 122 | 'ceph', 'config-key', 'del', str(key)]) | ||
114 | 123 | except subprocess.CalledProcessError as e: | ||
115 | 124 | log("Monitor config-key put failed with message: {}".format( | ||
116 | 125 | e.message)) | ||
117 | 126 | raise | ||
118 | 127 | |||
119 | 128 | |||
120 | 129 | def monitor_key_set(key, value): | ||
121 | 130 | """ | ||
122 | 131 | Sets a key value pair on the monitor cluster. | ||
123 | 132 | :param key: String. The key to set. | ||
124 | 133 | :param value: The value to set. This will be converted to a string | ||
125 | 134 | before setting | ||
126 | 135 | """ | ||
127 | 136 | try: | ||
128 | 137 | subprocess.check_output( | ||
129 | 138 | ['sudo', '-u', ceph_user(), | ||
130 | 139 | 'ceph', 'config-key', 'put', str(key), str(value)]) | ||
131 | 140 | except subprocess.CalledProcessError as e: | ||
132 | 141 | log("Monitor config-key put failed with message: {}".format( | ||
133 | 142 | e.message)) | ||
134 | 143 | raise | ||
135 | 144 | |||
136 | 145 | |||
137 | 146 | def monitor_key_get(key): | ||
138 | 147 | """ | ||
139 | 148 | Gets the value of an existing key in the monitor cluster. | ||
140 | 149 | :param key: String. The key to search for. | ||
141 | 150 | :return: Returns the value of that key or None if not found. | ||
142 | 151 | """ | ||
143 | 152 | try: | ||
144 | 153 | output = subprocess.check_output( | ||
145 | 154 | ['sudo', '-u', ceph_user(), | ||
146 | 155 | 'ceph', 'config-key', 'get', str(key)]) | ||
147 | 156 | return output | ||
148 | 157 | except subprocess.CalledProcessError as e: | ||
149 | 158 | log("Monitor config-key get failed with message: {}".format( | ||
150 | 159 | e.message)) | ||
151 | 160 | return None | ||
152 | 161 | |||
153 | 162 | |||
154 | 163 | def monitor_key_exists(key): | ||
155 | 164 | """ | ||
156 | 165 | Searches for the existence of a key in the monitor cluster. | ||
157 | 166 | :param key: String. The key to search for | ||
158 | 167 | :return: Returns True if the key exists, False if not and raises an | ||
159 | 168 | exception if an unknown error occurs. | ||
160 | 169 | """ | ||
161 | 170 | try: | ||
162 | 171 | subprocess.check_call( | ||
163 | 172 | ['sudo', '-u', ceph_user(), | ||
164 | 173 | 'ceph', 'config-key', 'exists', str(key)]) | ||
165 | 174 | # I can return true here regardless because Ceph returns | ||
166 | 175 | # ENOENT if the key wasn't found | ||
167 | 176 | return True | ||
168 | 177 | except subprocess.CalledProcessError as e: | ||
169 | 178 | if e.returncode == errno.ENOENT: | ||
170 | 179 | return False | ||
171 | 180 | else: | ||
172 | 181 | log("Unknown error from ceph config-get exists: {} {}".format( | ||
173 | 182 | e.returncode, e.message)) | ||
174 | 183 | raise | ||
175 | 184 | |||
176 | 185 | |||
177 | 57 | def get_version(): | 186 | def get_version(): |
178 | 58 | '''Derive Ceph release from an installed package.''' | 187 | '''Derive Ceph release from an installed package.''' |
179 | 59 | import apt_pkg as apt | 188 | import apt_pkg as apt |
180 | @@ -64,7 +193,7 @@ | |||
181 | 64 | pkg = cache[package] | 193 | pkg = cache[package] |
182 | 65 | except: | 194 | except: |
183 | 66 | # the package is unknown to the current apt cache. | 195 | # the package is unknown to the current apt cache. |
185 | 67 | e = 'Could not determine version of package with no installation '\ | 196 | e = 'Could not determine version of package with no installation ' \ |
186 | 68 | 'candidate: %s' % package | 197 | 'candidate: %s' % package |
187 | 69 | error_out(e) | 198 | error_out(e) |
188 | 70 | 199 | ||
189 | @@ -165,6 +294,7 @@ | |||
190 | 165 | # Ignore any errors for this call | 294 | # Ignore any errors for this call |
191 | 166 | subprocess.call(cmd) | 295 | subprocess.call(cmd) |
192 | 167 | 296 | ||
193 | 297 | |||
194 | 168 | DISK_FORMATS = [ | 298 | DISK_FORMATS = [ |
195 | 169 | 'xfs', | 299 | 'xfs', |
196 | 170 | 'ext4', | 300 | 'ext4', |
197 | @@ -211,6 +341,7 @@ | |||
198 | 211 | 341 | ||
199 | 212 | 342 | ||
200 | 213 | _bootstrap_keyring = "/var/lib/ceph/bootstrap-osd/ceph.keyring" | 343 | _bootstrap_keyring = "/var/lib/ceph/bootstrap-osd/ceph.keyring" |
201 | 344 | _upgrade_keyring = "/etc/ceph/ceph.client.admin.keyring" | ||
202 | 214 | 345 | ||
203 | 215 | 346 | ||
204 | 216 | def is_bootstrapped(): | 347 | def is_bootstrapped(): |
205 | @@ -236,6 +367,21 @@ | |||
206 | 236 | ] | 367 | ] |
207 | 237 | subprocess.check_call(cmd) | 368 | subprocess.check_call(cmd) |
208 | 238 | 369 | ||
209 | 370 | |||
210 | 371 | def import_osd_upgrade_key(key): | ||
211 | 372 | if not os.path.exists(_upgrade_keyring): | ||
212 | 373 | cmd = [ | ||
213 | 374 | "sudo", | ||
214 | 375 | "-u", | ||
215 | 376 | ceph_user(), | ||
216 | 377 | 'ceph-authtool', | ||
217 | 378 | _upgrade_keyring, | ||
218 | 379 | '--create-keyring', | ||
219 | 380 | '--name=client.admin', | ||
220 | 381 | '--add-key={}'.format(key) | ||
221 | 382 | ] | ||
222 | 383 | subprocess.check_call(cmd) | ||
223 | 384 | |||
224 | 239 | # OSD caps taken from ceph-create-keys | 385 | # OSD caps taken from ceph-create-keys |
225 | 240 | _osd_bootstrap_caps = { | 386 | _osd_bootstrap_caps = { |
226 | 241 | 'mon': [ | 387 | 'mon': [ |
227 | @@ -402,7 +548,7 @@ | |||
228 | 402 | 548 | ||
229 | 403 | 549 | ||
230 | 404 | def maybe_zap_journal(journal_dev): | 550 | def maybe_zap_journal(journal_dev): |
232 | 405 | if (is_osd_disk(journal_dev)): | 551 | if is_osd_disk(journal_dev): |
233 | 406 | log('Looks like {} is already an OSD data' | 552 | log('Looks like {} is already an OSD data' |
234 | 407 | ' or journal, skipping.'.format(journal_dev)) | 553 | ' or journal, skipping.'.format(journal_dev)) |
235 | 408 | return | 554 | return |
236 | @@ -445,7 +591,7 @@ | |||
237 | 445 | log('Path {} is not a block device - bailing'.format(dev)) | 591 | log('Path {} is not a block device - bailing'.format(dev)) |
238 | 446 | return | 592 | return |
239 | 447 | 593 | ||
241 | 448 | if (is_osd_disk(dev) and not reformat_osd): | 594 | if is_osd_disk(dev) and not reformat_osd: |
242 | 449 | log('Looks like {} is already an' | 595 | log('Looks like {} is already an' |
243 | 450 | ' OSD data or journal, skipping.'.format(dev)) | 596 | ' OSD data or journal, skipping.'.format(dev)) |
244 | 451 | return | 597 | return |
245 | @@ -512,7 +658,7 @@ | |||
246 | 512 | 658 | ||
247 | 513 | 659 | ||
248 | 514 | def get_running_osds(): | 660 | def get_running_osds(): |
250 | 515 | '''Returns a list of the pids of the current running OSD daemons''' | 661 | """Returns a list of the pids of the current running OSD daemons""" |
251 | 516 | cmd = ['pgrep', 'ceph-osd'] | 662 | cmd = ['pgrep', 'ceph-osd'] |
252 | 517 | try: | 663 | try: |
253 | 518 | result = subprocess.check_output(cmd) | 664 | result = subprocess.check_output(cmd) |
254 | 519 | 665 | ||
255 | === modified file 'hooks/ceph_hooks.py' | |||
256 | --- hooks/ceph_hooks.py 2016-02-25 15:48:22 +0000 | |||
257 | +++ hooks/ceph_hooks.py 2016-03-01 16:57:01 +0000 | |||
258 | @@ -8,12 +8,17 @@ | |||
259 | 8 | # | 8 | # |
260 | 9 | 9 | ||
261 | 10 | import glob | 10 | import glob |
262 | 11 | import hashlib | ||
263 | 11 | import os | 12 | import os |
264 | 13 | import random | ||
265 | 12 | import shutil | 14 | import shutil |
266 | 15 | import subprocess | ||
267 | 13 | import sys | 16 | import sys |
268 | 14 | import tempfile | 17 | import tempfile |
269 | 18 | import time | ||
270 | 15 | 19 | ||
271 | 16 | import ceph | 20 | import ceph |
272 | 21 | from charmhelpers.core import hookenv | ||
273 | 17 | from charmhelpers.core.hookenv import ( | 22 | from charmhelpers.core.hookenv import ( |
274 | 18 | log, | 23 | log, |
275 | 19 | ERROR, | 24 | ERROR, |
276 | @@ -39,13 +44,13 @@ | |||
277 | 39 | filter_installed_packages, | 44 | filter_installed_packages, |
278 | 40 | ) | 45 | ) |
279 | 41 | from charmhelpers.core.sysctl import create as create_sysctl | 46 | from charmhelpers.core.sysctl import create as create_sysctl |
280 | 47 | from charmhelpers.core import host | ||
281 | 42 | 48 | ||
282 | 43 | from utils import ( | 49 | from utils import ( |
283 | 44 | get_host_ip, | 50 | get_host_ip, |
284 | 45 | get_networks, | 51 | get_networks, |
285 | 46 | assert_charm_supports_ipv6, | 52 | assert_charm_supports_ipv6, |
288 | 47 | render_template, | 53 | render_template) |
287 | 48 | ) | ||
289 | 49 | 54 | ||
290 | 50 | from charmhelpers.contrib.openstack.alternatives import install_alternative | 55 | from charmhelpers.contrib.openstack.alternatives import install_alternative |
291 | 51 | from charmhelpers.contrib.network.ip import ( | 56 | from charmhelpers.contrib.network.ip import ( |
292 | @@ -57,6 +62,188 @@ | |||
293 | 57 | 62 | ||
294 | 58 | hooks = Hooks() | 63 | hooks = Hooks() |
295 | 59 | 64 | ||
296 | 65 | # A dict of valid ceph upgrade paths. Mapping is old -> new | ||
297 | 66 | upgrade_paths = { | ||
298 | 67 | 'cloud:trusty-juno': 'cloud:trusty-kilo', | ||
299 | 68 | 'cloud:trusty-kilo': 'cloud:trusty-liberty', | ||
300 | 69 | 'cloud:trusty-liberty': None, | ||
301 | 70 | } | ||
302 | 71 | |||
303 | 72 | |||
304 | 73 | def pretty_print_upgrade_paths(): | ||
305 | 74 | lines = [] | ||
306 | 75 | for key, value in upgrade_paths.iteritems(): | ||
307 | 76 | lines.append("{} -> {}".format(key, value)) | ||
308 | 77 | return lines | ||
309 | 78 | |||
310 | 79 | |||
311 | 80 | def check_for_upgrade(): | ||
312 | 81 | c = hookenv.config() | ||
313 | 82 | old_version = c.previous('source') | ||
314 | 83 | log('old_version: {}'.format(old_version)) | ||
315 | 84 | # Strip all whitespace | ||
316 | 85 | new_version = config('source') | ||
317 | 86 | if new_version: | ||
318 | 87 | # replace all whitespace | ||
319 | 88 | new_version = new_version.replace(' ', '') | ||
320 | 89 | log('new_version: {}'.format(new_version)) | ||
321 | 90 | |||
322 | 91 | if old_version in upgrade_paths: | ||
323 | 92 | if new_version == upgrade_paths[old_version]: | ||
324 | 93 | log("{} to {} is a valid upgrade path. Proceeding.".format( | ||
325 | 94 | old_version, new_version)) | ||
326 | 95 | roll_osd_cluster(new_version) | ||
327 | 96 | else: | ||
328 | 97 | # Log a helpful error message | ||
329 | 98 | log("Invalid upgrade path from {} to {}. " | ||
330 | 99 | "Valid paths are: {}".format(old_version, | ||
331 | 100 | new_version, | ||
332 | 101 | pretty_print_upgrade_paths())) | ||
333 | 102 | |||
334 | 103 | |||
335 | 104 | def lock_and_roll(my_hash): | ||
336 | 105 | start_timestamp = time.time() | ||
337 | 106 | |||
338 | 107 | ceph.monitor_key_set("{}_start".format(my_hash), start_timestamp) | ||
339 | 108 | log("Rolling") | ||
340 | 109 | # This should be quick | ||
341 | 110 | upgrade_osd() | ||
342 | 111 | log("Done") | ||
343 | 112 | |||
344 | 113 | stop_timestamp = time.time() | ||
345 | 114 | # Set a key to inform others I am finished | ||
346 | 115 | ceph.monitor_key_set("{}_done".format(my_hash), stop_timestamp) | ||
347 | 116 | |||
348 | 117 | |||
349 | 118 | def get_hostname(): | ||
350 | 119 | try: | ||
351 | 120 | with open('/etc/hostname', 'r') as host_file: | ||
352 | 121 | host_lines = host_file.readlines() | ||
353 | 122 | if host_lines: | ||
354 | 123 | return host_lines[0].strip() | ||
355 | 124 | except IOError: | ||
356 | 125 | raise | ||
357 | 126 | |||
358 | 127 | |||
359 | 128 | # TODO: Timeout busted nodes and keep moving forward | ||
360 | 129 | # Edge cases: | ||
361 | 130 | # 1. Previous node dies on upgrade, can we retry? | ||
362 | 131 | # 2. This assumes that the osd failure domain is not set to osd. | ||
363 | 132 | # It rolls an entire server at a time. | ||
364 | 133 | def roll_osd_cluster(new_version): | ||
365 | 134 | """ | ||
366 | 135 | This is tricky to get right so here's what we're going to do. | ||
367 | 136 | There's 2 possible cases: Either I'm first in line or not. | ||
368 | 137 | If I'm not first in line I'll wait a random time between 5-30 seconds | ||
369 | 138 | and test to see if the previous osd is upgraded yet. | ||
370 | 139 | |||
371 | 140 | TODO: If you're not in the same failure domain it's safe to upgrade | ||
372 | 141 | 1. Examine all pools and adopt the most strict failure domain policy | ||
373 | 142 | Example: Pool 1: Failure domain = rack | ||
374 | 143 | Pool 2: Failure domain = host | ||
375 | 144 | Pool 3: Failure domain = row | ||
376 | 145 | |||
377 | 146 | outcome: Failure domain = host | ||
378 | 147 | """ | ||
379 | 148 | log('roll_osd_cluster called with {}'.format(new_version)) | ||
380 | 149 | my_hostname = None | ||
381 | 150 | try: | ||
382 | 151 | my_hostname = get_hostname() | ||
383 | 152 | except IOError as err: | ||
384 | 153 | log("Failed to read /etc/hostname. Error: {}".format(err.message)) | ||
385 | 154 | status_set('blocked', 'failed to upgrade monitor') | ||
386 | 155 | |||
387 | 156 | my_hash = hashlib.sha224(my_hostname).hexdigest() | ||
388 | 157 | # A sorted list of hashed osd names | ||
389 | 158 | osd_hashed_dict = {} | ||
390 | 159 | osd_tree_list = ceph.get_osd_tree() | ||
391 | 160 | osd_hashed_list = sorted([hashlib.sha224( | ||
392 | 161 | i.name.encode('utf-8')).hexdigest() for i in osd_tree_list]) | ||
393 | 162 | # Save a hash : name mapping so we can show the user which | ||
394 | 163 | # unit name we're waiting on | ||
395 | 164 | for i in osd_tree_list: | ||
396 | 165 | osd_hashed_dict[ | ||
397 | 166 | hashlib.sha224( | ||
398 | 167 | i.name.encode('utf-8')).hexdigest() | ||
399 | 168 | ] = i.name | ||
400 | 169 | log("osd_hashed_list: {}".format(osd_hashed_list)) | ||
401 | 170 | try: | ||
402 | 171 | position = osd_hashed_list.index(my_hash) | ||
403 | 172 | log("upgrade position: {}".format(position)) | ||
404 | 173 | if position == 0: | ||
405 | 174 | # I'm first! Roll | ||
406 | 175 | # First set a key to inform others I'm about to roll | ||
407 | 176 | lock_and_roll(my_hash=my_hash) | ||
408 | 177 | else: | ||
409 | 178 | # Check if the previous node has finished | ||
410 | 179 | status_set('blocked', | ||
411 | 180 | 'Waiting on {} to finish upgrading'.format( | ||
412 | 181 | osd_hashed_dict[ | ||
413 | 182 | osd_hashed_list[position - 1]] | ||
414 | 183 | )) | ||
415 | 184 | previous_node_finished = ceph.monitor_key_exists( | ||
416 | 185 | "{}_done".format(osd_hashed_list[position - 1])) | ||
417 | 186 | |||
418 | 187 | # Block and wait on the previous nodes to finish | ||
419 | 188 | while previous_node_finished is False: | ||
420 | 189 | log("previous is not finished. Waiting") | ||
421 | 190 | # Has this node been trying to upgrade for longer than 10 minutes? | ||
422 | 191 | # If so then move on and consider that node dead. | ||
423 | 192 | |||
424 | 193 | # NOTE: This assumes the clusters clocks are somewhat accurate | ||
425 | 194 | current_timestamp = time.time() | ||
426 | 195 | previous_node_start_time = ceph.monitor_key_get( | ||
427 | 196 | "{}_start".format(osd_hashed_list[position - 1])) | ||
428 | 197 | if (current_timestamp - (10 * 60)) > previous_node_start_time: | ||
429 | 198 | # Previous node is probably dead. Lets move on | ||
430 | 199 | if previous_node_start_time is not None: | ||
431 | 200 | log("Previous node {} appears dead. {} > {} Moving on".format( | ||
432 | 201 | osd_hashed_dict[osd_hashed_list[position - 1]], | ||
433 | 202 | (current_timestamp - (10 * 60)), | ||
434 | 203 | previous_node_start_time | ||
435 | 204 | )) | ||
436 | 205 | lock_and_roll(my_hash=my_hash) | ||
437 | 206 | else: | ||
438 | 207 | # ?? | ||
439 | 208 | pass | ||
440 | 209 | else: | ||
441 | 210 | # I have to wait. Sleep a random amount of time and then | ||
442 | 211 | # check if I can lock,upgrade and roll. | ||
443 | 212 | time.sleep(random.randrange(5, 30)) | ||
444 | 213 | previous_node_finished = ceph.monitor_key_exists( | ||
445 | 214 | "{}_done".format(osd_hashed_list[position - 1])) | ||
446 | 215 | lock_and_roll(my_hash=my_hash) | ||
447 | 216 | except ValueError: | ||
448 | 217 | log("Unable to find ceph monitor hash in list") | ||
449 | 218 | status_set('blocked', 'failed to upgrade monitor') | ||
450 | 219 | |||
451 | 220 | |||
452 | 221 | def upgrade_osd(): | ||
453 | 222 | add_source(config('source'), config('key')) | ||
454 | 223 | |||
455 | 224 | current_version = ceph.get_version() | ||
456 | 225 | status_set("maintenance", "Upgrading osd") | ||
457 | 226 | log("Current ceph version is {}".format(current_version)) | ||
458 | 227 | new_version = config('release-version') | ||
459 | 228 | log("Upgrading to: {}".format(new_version)) | ||
460 | 229 | |||
461 | 230 | try: | ||
462 | 231 | add_source(config('source'), config('key')) | ||
463 | 232 | apt_update(fatal=True) | ||
464 | 233 | except subprocess.CalledProcessError as err: | ||
465 | 234 | log("Adding the ceph source failed with message: {}".format( | ||
466 | 235 | err.message)) | ||
467 | 236 | status_set("blocked", "Upgrade to {} failed".format(new_version)) | ||
468 | 237 | try: | ||
469 | 238 | host.service_stop('ceph-osd-all') | ||
470 | 239 | apt_install(packages=ceph.PACKAGES, fatal=True) | ||
471 | 240 | host.service_start('ceph-osd-all') | ||
472 | 241 | status_set("active", "") | ||
473 | 242 | except subprocess.CalledProcessError as err: | ||
474 | 243 | log("Stopping ceph and upgrading packages failed " | ||
475 | 244 | "with message: {}".format(err.message)) | ||
476 | 245 | status_set("blocked", "Upgrade to {} failed".format(new_version)) | ||
477 | 246 | |||
478 | 60 | 247 | ||
479 | 61 | def install_upstart_scripts(): | 248 | def install_upstart_scripts(): |
480 | 62 | # Only install upstart configurations for older versions | 249 | # Only install upstart configurations for older versions |
481 | @@ -113,6 +300,7 @@ | |||
482 | 113 | install_alternative('ceph.conf', '/etc/ceph/ceph.conf', | 300 | install_alternative('ceph.conf', '/etc/ceph/ceph.conf', |
483 | 114 | charm_ceph_conf, 90) | 301 | charm_ceph_conf, 90) |
484 | 115 | 302 | ||
485 | 303 | |||
486 | 116 | JOURNAL_ZAPPED = '/var/lib/ceph/journal_zapped' | 304 | JOURNAL_ZAPPED = '/var/lib/ceph/journal_zapped' |
487 | 117 | 305 | ||
488 | 118 | 306 | ||
489 | @@ -147,6 +335,9 @@ | |||
490 | 147 | 335 | ||
491 | 148 | @hooks.hook('config-changed') | 336 | @hooks.hook('config-changed') |
492 | 149 | def config_changed(): | 337 | def config_changed(): |
493 | 338 | # Check if an upgrade was requested | ||
494 | 339 | check_for_upgrade() | ||
495 | 340 | |||
496 | 150 | # Pre-flight checks | 341 | # Pre-flight checks |
497 | 151 | if config('osd-format') not in ceph.DISK_FORMATS: | 342 | if config('osd-format') not in ceph.DISK_FORMATS: |
498 | 152 | log('Invalid OSD disk format configuration specified', level=ERROR) | 343 | log('Invalid OSD disk format configuration specified', level=ERROR) |
499 | @@ -160,7 +351,7 @@ | |||
500 | 160 | create_sysctl(sysctl_dict, '/etc/sysctl.d/50-ceph-osd-charm.conf') | 351 | create_sysctl(sysctl_dict, '/etc/sysctl.d/50-ceph-osd-charm.conf') |
501 | 161 | 352 | ||
502 | 162 | e_mountpoint = config('ephemeral-unmount') | 353 | e_mountpoint = config('ephemeral-unmount') |
504 | 163 | if (e_mountpoint and ceph.filesystem_mounted(e_mountpoint)): | 354 | if e_mountpoint and ceph.filesystem_mounted(e_mountpoint): |
505 | 164 | umount(e_mountpoint) | 355 | umount(e_mountpoint) |
506 | 165 | prepare_disks_and_activate() | 356 | prepare_disks_and_activate() |
507 | 166 | 357 | ||
508 | @@ -189,8 +380,9 @@ | |||
509 | 189 | hosts = [] | 380 | hosts = [] |
510 | 190 | for relid in relation_ids('mon'): | 381 | for relid in relation_ids('mon'): |
511 | 191 | for unit in related_units(relid): | 382 | for unit in related_units(relid): |
514 | 192 | addr = relation_get('ceph-public-address', unit, relid) or \ | 383 | addr = relation_get( |
515 | 193 | get_host_ip(relation_get('private-address', unit, relid)) | 384 | 'ceph-public-address', unit, relid) or \ |
516 | 385 | get_host_ip(relation_get('private-address', unit, relid)) | ||
517 | 194 | 386 | ||
518 | 195 | if addr: | 387 | if addr: |
519 | 196 | hosts.append('{}:6789'.format(format_ipv6_addr(addr) or addr)) | 388 | hosts.append('{}:6789'.format(format_ipv6_addr(addr) or addr)) |
520 | @@ -246,10 +438,12 @@ | |||
521 | 246 | 'mon-relation-departed') | 438 | 'mon-relation-departed') |
522 | 247 | def mon_relation(): | 439 | def mon_relation(): |
523 | 248 | bootstrap_key = relation_get('osd_bootstrap_key') | 440 | bootstrap_key = relation_get('osd_bootstrap_key') |
524 | 441 | upgrade_key = relation_get('osd_upgrade_key') | ||
525 | 249 | if get_fsid() and get_auth() and bootstrap_key: | 442 | if get_fsid() and get_auth() and bootstrap_key: |
526 | 250 | log('mon has provided conf- scanning disks') | 443 | log('mon has provided conf- scanning disks') |
527 | 251 | emit_cephconf() | 444 | emit_cephconf() |
528 | 252 | ceph.import_osd_bootstrap_key(bootstrap_key) | 445 | ceph.import_osd_bootstrap_key(bootstrap_key) |
529 | 446 | ceph.import_osd_upgrade_key(upgrade_key) | ||
530 | 253 | prepare_disks_and_activate() | 447 | prepare_disks_and_activate() |
531 | 254 | else: | 448 | else: |
532 | 255 | log('mon cluster has not yet provided conf') | 449 | log('mon cluster has not yet provided conf') |
533 | 256 | 450 | ||
534 | === modified file 'hooks/utils.py' | |||
535 | --- hooks/utils.py 2016-02-18 17:10:53 +0000 | |||
536 | +++ hooks/utils.py 2016-03-01 16:57:01 +0000 | |||
537 | @@ -1,4 +1,3 @@ | |||
538 | 1 | |||
539 | 2 | # | 1 | # |
540 | 3 | # Copyright 2012 Canonical Ltd. | 2 | # Copyright 2012 Canonical Ltd. |
541 | 4 | # | 3 | # |
542 | @@ -12,8 +11,8 @@ | |||
543 | 12 | from charmhelpers.core.hookenv import ( | 11 | from charmhelpers.core.hookenv import ( |
544 | 13 | unit_get, | 12 | unit_get, |
545 | 14 | cached, | 13 | cached, |
548 | 15 | config | 14 | config, |
549 | 16 | ) | 15 | status_set) |
550 | 17 | from charmhelpers.fetch import ( | 16 | from charmhelpers.fetch import ( |
551 | 18 | apt_install, | 17 | apt_install, |
552 | 19 | filter_installed_packages | 18 | filter_installed_packages |
553 | @@ -87,6 +86,35 @@ | |||
554 | 87 | return answers[0].address | 86 | return answers[0].address |
555 | 88 | 87 | ||
556 | 89 | 88 | ||
557 | 89 | def get_public_addr(): | ||
558 | 90 | return get_network_addrs('ceph-public-network')[0] | ||
559 | 91 | |||
560 | 92 | |||
561 | 93 | def get_network_addrs(config_opt): | ||
562 | 94 | """Get all configured public networks addresses. | ||
563 | 95 | |||
564 | 96 | If public network(s) are provided, go through them and return the | ||
565 | 97 | addresses we have configured on any of those networks. | ||
566 | 98 | """ | ||
567 | 99 | addrs = [] | ||
568 | 100 | networks = config(config_opt) | ||
569 | 101 | if networks: | ||
570 | 102 | networks = networks.split() | ||
571 | 103 | addrs = [get_address_in_network(n) for n in networks] | ||
572 | 104 | addrs = [a for a in addrs if a] | ||
573 | 105 | |||
574 | 106 | if not addrs: | ||
575 | 107 | if networks: | ||
576 | 108 | msg = ("Could not find an address on any of '%s' - resolve this " | ||
577 | 109 | "error to retry" % networks) | ||
578 | 110 | status_set('blocked', msg) | ||
579 | 111 | raise Exception(msg) | ||
580 | 112 | else: | ||
581 | 113 | return [get_host_ip()] | ||
582 | 114 | |||
583 | 115 | return addrs | ||
584 | 116 | |||
585 | 117 | |||
586 | 90 | def get_networks(config_opt='ceph-public-network'): | 118 | def get_networks(config_opt='ceph-public-network'): |
587 | 91 | """Get all configured networks from provided config option. | 119 | """Get all configured networks from provided config option. |
588 | 92 | 120 | ||
589 | 93 | 121 | ||
590 | === modified file 'templates/ceph.conf' | |||
591 | --- templates/ceph.conf 2016-01-18 16:42:36 +0000 | |||
592 | +++ templates/ceph.conf 2016-03-01 16:57:01 +0000 | |||
593 | @@ -6,7 +6,7 @@ | |||
594 | 6 | auth service required = {{ auth_supported }} | 6 | auth service required = {{ auth_supported }} |
595 | 7 | auth client required = {{ auth_supported }} | 7 | auth client required = {{ auth_supported }} |
596 | 8 | {% endif %} | 8 | {% endif %} |
598 | 9 | keyring = /etc/ceph/$cluster.$name.keyring | 9 | keyring = /etc/ceph/ceph.client.admin.keyring |
599 | 10 | mon host = {{ mon_hosts }} | 10 | mon host = {{ mon_hosts }} |
600 | 11 | fsid = {{ fsid }} | 11 | fsid = {{ fsid }} |
601 | 12 | 12 |
Note: I put the helpers up for review on charmhelpers: https:/ /code.launchpad .net/~xfactor97 3/charm- helpers/ ceph-keystore/ +merge/ 287205