Merge ~mgerdts/cloud-init:lp1667735 into cloud-init:master
- Git
- lp:~mgerdts/cloud-init
- lp1667735
- Merge into master
Status: | Merged |
---|---|
Approved by: | Scott Moser |
Approved revision: | 076968c6efc9759d196e604df24aa1bd1701b9be |
Merge reported by: | Scott Moser |
Merged at revision: | 4c573d0e0173d2b1e99a383c54a0a6c957aa1cbb |
Proposed branch: | ~mgerdts/cloud-init:lp1667735 |
Merge into: | cloud-init:master |
Diff against target: |
353 lines (+204/-16) 2 files modified
cloudinit/sources/DataSourceSmartOS.py (+102/-15) tests/unittests/test_datasource/test_smartos.py (+102/-1) |
Related bugs: |
Reviewer | Review Type | Date Requested | Status |
---|---|---|---|
Server Team CI bot | continuous-integration | Approve | |
Scott Moser | Approve | ||
Review via email:
|
Commit message
DataSourceSmartOS: fix hang when metadata service is down
If the metadata service in the host is down while a guest that uses
DataSourceSmartOS is booting, the request from the guest falls into the
bit bucket. When the metadata service is eventually started, the guest
has no awareness of this and does not resend the request. This results in
cloud-init hanging forever with a guest reboot as the only recovery
option.
This fix updates the metadata protocol to implement the initialization
phase, just as is implemented by mdata-get and related utilities. The
initialization phase includes draining all pending data from the serial
port, writing an empty command and getting an expected error message in
reply. If the initialization phase times out, it is retried every five
seconds. Each timeout results in a warning message: "Timeout while
initializing metadata client. Is the host metadata service running?" By
default, warning messages are logged to the console, thus the reason for a
hung boot is readily apparent.
LP: #1667735
Description of the change

Scott Moser (smoser) wrote : | # |

Server Team CI bot (server-team-bot) wrote : | # |
FAILED: Continuous integration, rev:
https:/
Executed test runs:
SUCCESS: Checkout
FAILED: Unit & Style Tests
Click here to trigger a rebuild:
https:/
- 62cafc5... by Mike Gerdts
-
smoser and CI feedback

Mike Gerdts (mgerdts) wrote : | # |
On Thu, Apr 12, 2018 at 8:34 PM, Scott Moser <email address hidden>
wrote:
> I like it.
> Good work.
> I've pointed the c-i bot at it, so it will comment soon.
>
> https:/
>
> there are 2 tiny things inline.
>
Thanks for taking a look. I address your comments below and will also
address an issue found by the CI bot.
>
>
> Diff comments:
>
> > diff --git a/cloudinit/
> DataSourceSmart
> > index 86bfa5d..5717dae 100644
> > --- a/cloudinit/
> > +++ b/cloudinit/
> > @@ -360,6 +371,45 @@ class JoyentMetadataC
> > LOG.debug('Value "%s" found.', value)
> > return value
> >
> > + def _readline(self):
> > + """
> > + Reads a line a byte at a time until \n is encountered.
> Returns an
> > + ascii string with the trailing newline removed.
> > +
> > + If a timeout (per-byte) is set and it expires, a
> > + JoyentMetadataF
> > + """
> > + response = bytearray()
> > + while True:
> > + try:
> > + byte = self.fp.read(1)
> > + if len(byte) == 0:
> > + raise JoyentMetadataT
> > + "Partial response: '%s'" %
> response.
> > + if ord(byte) == ord(b'\n'):
>
> is ord needed here?
> we read a byte. can't we just compare it to b'\n' directly?
>
When reading from a serial port, that works fine. I've been unable to
figure out the magic to put bytes of the same kind on response_parts array.
With this:
diff --git a/cloudinit/
b/cloudinit/
index 5717dae..f368e22 100644
--- a/cloudinit/
+++ b/cloudinit/
@@ -383,12 +383,14 @@ class JoyentMetadataC
while True:
try:
+ LOG.debug('read byte %r len %d newline %s', byte,
len(byte),
+ byte == b'\n')
if len(byte) == 0:
response.
- if ord(byte) == ord(b'\n'):
+ if byte == b'\n':
- response.
+ response.
except OSError as exc:
if exc.errno == errno.EAGAIN:
I get:
=======
ERROR: test_negotiate
(tests.
-------
Traceback (most recent call last):
File
"/home/
line 761, in test_negotiate
client.
File "/home/
- 5a78cc5... by Mike Gerdts
-
blist from smoser
- 2e968f1... by Mike Gerdts
-
smoser fixes #2

Scott Moser (smoser) wrote : | # |
tiny nit.

Scott Moser (smoser) wrote : | # |
https:/
I approve of this branch, either fix the nit in my last comment or justify it.

Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
SUCCESS: MAAS Compatability Testing
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/
- 076968c... by Mike Gerdts
-
smoser final feedback: redundant flush

Mike Gerdts (mgerdts) wrote : | # |
Yes, that was redundant. Fixed & retested (tox + real system). Fixed in
changeset:
commit 076968c6efc9759
Author: Mike Gerdts <email address hidden>
Date: Mon Apr 16 22:39:35 2018 -0500
smoser final feedback: redundant flush
Thanks much for the review, help, and approval.
Mike
On Mon, Apr 16, 2018 at 7:50 PM, Scott Moser <email address hidden>
wrote:
> Review: Approve
>
> https:/
>
> I approve of this branch, either fix the nit in my last comment or justify
> it.
>
>
> --
> https:/
> init/+merge/343118
> You are the owner of ~mgerdts/
>

Scott Moser (smoser) : | # |

Server Team CI bot (server-team-bot) wrote : | # |
PASSED: Continuous integration, rev:
https:/
Executed test runs:
SUCCESS: Checkout
SUCCESS: Unit & Style Tests
SUCCESS: Ubuntu LTS: Build
SUCCESS: Ubuntu LTS: Integration
SUCCESS: MAAS Compatability Testing
IN_PROGRESS: Declarative: Post Actions
Click here to trigger a rebuild:
https:/

Scott Moser (smoser) wrote : | # |
An upstream commit landed for this bug.
To view that commit see the following URL:
https:/
Preview Diff
1 | diff --git a/cloudinit/sources/DataSourceSmartOS.py b/cloudinit/sources/DataSourceSmartOS.py |
2 | index 86bfa5d..a904808 100644 |
3 | --- a/cloudinit/sources/DataSourceSmartOS.py |
4 | +++ b/cloudinit/sources/DataSourceSmartOS.py |
5 | @@ -1,4 +1,5 @@ |
6 | # Copyright (C) 2013 Canonical Ltd. |
7 | +# Copyright (c) 2018, Joyent, Inc. |
8 | # |
9 | # Author: Ben Howard <ben.howard@canonical.com> |
10 | # |
11 | @@ -21,6 +22,7 @@ |
12 | |
13 | import base64 |
14 | import binascii |
15 | +import errno |
16 | import json |
17 | import os |
18 | import random |
19 | @@ -229,6 +231,9 @@ class DataSourceSmartOS(sources.DataSource): |
20 | self.md_client) |
21 | return False |
22 | |
23 | + # Open once for many requests, rather than once for each request |
24 | + self.md_client.open_transport() |
25 | + |
26 | for ci_noun, attribute in SMARTOS_ATTRIB_MAP.items(): |
27 | smartos_noun, strip = attribute |
28 | md[ci_noun] = self.md_client.get(smartos_noun, strip=strip) |
29 | @@ -236,6 +241,8 @@ class DataSourceSmartOS(sources.DataSource): |
30 | for ci_noun, smartos_noun in SMARTOS_ATTRIB_JSON.items(): |
31 | md[ci_noun] = self.md_client.get_json(smartos_noun) |
32 | |
33 | + self.md_client.close_transport() |
34 | + |
35 | # @datadictionary: This key may contain a program that is written |
36 | # to a file in the filesystem of the guest on each boot and then |
37 | # executed. It may be of any format that would be considered |
38 | @@ -316,6 +323,10 @@ class JoyentMetadataFetchException(Exception): |
39 | pass |
40 | |
41 | |
42 | +class JoyentMetadataTimeoutException(JoyentMetadataFetchException): |
43 | + pass |
44 | + |
45 | + |
46 | class JoyentMetadataClient(object): |
47 | """ |
48 | A client implementing v2 of the Joyent Metadata Protocol Specification. |
49 | @@ -360,6 +371,47 @@ class JoyentMetadataClient(object): |
50 | LOG.debug('Value "%s" found.', value) |
51 | return value |
52 | |
53 | + def _readline(self): |
54 | + """ |
55 | + Reads a line a byte at a time until \n is encountered. Returns an |
56 | + ascii string with the trailing newline removed. |
57 | + |
58 | + If a timeout (per-byte) is set and it expires, a |
59 | + JoyentMetadataFetchException will be thrown. |
60 | + """ |
61 | + response = [] |
62 | + |
63 | + def as_ascii(): |
64 | + return b''.join(response).decode('ascii') |
65 | + |
66 | + msg = "Partial response: '%s'" |
67 | + while True: |
68 | + try: |
69 | + byte = self.fp.read(1) |
70 | + if len(byte) == 0: |
71 | + raise JoyentMetadataTimeoutException(msg % as_ascii()) |
72 | + if byte == b'\n': |
73 | + return as_ascii() |
74 | + response.append(byte) |
75 | + except OSError as exc: |
76 | + if exc.errno == errno.EAGAIN: |
77 | + raise JoyentMetadataTimeoutException(msg % as_ascii()) |
78 | + raise |
79 | + |
80 | + def _write(self, msg): |
81 | + self.fp.write(msg.encode('ascii')) |
82 | + self.fp.flush() |
83 | + |
84 | + def _negotiate(self): |
85 | + LOG.debug('Negotiating protocol V2') |
86 | + self._write('NEGOTIATE V2\n') |
87 | + response = self._readline() |
88 | + LOG.debug('read "%s"', response) |
89 | + if response != 'V2_OK': |
90 | + raise JoyentMetadataFetchException( |
91 | + 'Invalid response "%s" to "NEGOTIATE V2"' % response) |
92 | + LOG.debug('Negotiation complete') |
93 | + |
94 | def request(self, rtype, param=None): |
95 | request_id = '{0:08x}'.format(random.randint(0, 0xffffffff)) |
96 | message_body = ' '.join((request_id, rtype,)) |
97 | @@ -374,18 +426,11 @@ class JoyentMetadataClient(object): |
98 | self.open_transport() |
99 | need_close = True |
100 | |
101 | - self.fp.write(msg.encode('ascii')) |
102 | - self.fp.flush() |
103 | - |
104 | - response = bytearray() |
105 | - response.extend(self.fp.read(1)) |
106 | - while response[-1:] != b'\n': |
107 | - response.extend(self.fp.read(1)) |
108 | - |
109 | + self._write(msg) |
110 | + response = self._readline() |
111 | if need_close: |
112 | self.close_transport() |
113 | |
114 | - response = response.rstrip().decode('ascii') |
115 | LOG.debug('Read "%s" from metadata transport.', response) |
116 | |
117 | if 'SUCCESS' not in response: |
118 | @@ -450,6 +495,7 @@ class JoyentMetadataSocketClient(JoyentMetadataClient): |
119 | sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM) |
120 | sock.connect(self.socketpath) |
121 | self.fp = sock.makefile('rwb') |
122 | + self._negotiate() |
123 | |
124 | def exists(self): |
125 | return os.path.exists(self.socketpath) |
126 | @@ -459,8 +505,9 @@ class JoyentMetadataSocketClient(JoyentMetadataClient): |
127 | |
128 | |
129 | class JoyentMetadataSerialClient(JoyentMetadataClient): |
130 | - def __init__(self, device, timeout=10, smartos_type=SMARTOS_ENV_KVM): |
131 | - super(JoyentMetadataSerialClient, self).__init__(smartos_type) |
132 | + def __init__(self, device, timeout=10, smartos_type=SMARTOS_ENV_KVM, |
133 | + fp=None): |
134 | + super(JoyentMetadataSerialClient, self).__init__(smartos_type, fp) |
135 | self.device = device |
136 | self.timeout = timeout |
137 | |
138 | @@ -468,10 +515,50 @@ class JoyentMetadataSerialClient(JoyentMetadataClient): |
139 | return os.path.exists(self.device) |
140 | |
141 | def open_transport(self): |
142 | - ser = serial.Serial(self.device, timeout=self.timeout) |
143 | - if not ser.isOpen(): |
144 | - raise SystemError("Unable to open %s" % self.device) |
145 | - self.fp = ser |
146 | + if self.fp is None: |
147 | + ser = serial.Serial(self.device, timeout=self.timeout) |
148 | + if not ser.isOpen(): |
149 | + raise SystemError("Unable to open %s" % self.device) |
150 | + self.fp = ser |
151 | + self._flush() |
152 | + self._negotiate() |
153 | + |
154 | + def _flush(self): |
155 | + LOG.debug('Flushing input') |
156 | + # Read any pending data |
157 | + timeout = self.fp.timeout |
158 | + self.fp.timeout = 0.1 |
159 | + while True: |
160 | + try: |
161 | + self._readline() |
162 | + except JoyentMetadataTimeoutException: |
163 | + break |
164 | + LOG.debug('Input empty') |
165 | + |
166 | + # Send a newline and expect "invalid command". Keep trying until |
167 | + # successful. Retry rather frequently so that the "Is the host |
168 | + # metadata service running" appears on the console soon after someone |
169 | + # attaches in an effort to debug. |
170 | + if timeout > 5: |
171 | + self.fp.timeout = 5 |
172 | + else: |
173 | + self.fp.timeout = timeout |
174 | + while True: |
175 | + LOG.debug('Writing newline, expecting "invalid command"') |
176 | + self._write('\n') |
177 | + try: |
178 | + response = self._readline() |
179 | + if response == 'invalid command': |
180 | + break |
181 | + if response == 'FAILURE': |
182 | + LOG.debug('Got "FAILURE". Retrying.') |
183 | + continue |
184 | + LOG.warning('Unexpected response "%s" during flush', response) |
185 | + except JoyentMetadataTimeoutException: |
186 | + LOG.warning('Timeout while initializing metadata client. ' + |
187 | + 'Is the host metadata service running?') |
188 | + LOG.debug('Got "invalid command". Flush complete.') |
189 | + self.fp.timeout = timeout |
190 | |
191 | def __repr__(self): |
192 | return "%s(device=%s, timeout=%s)" % ( |
193 | diff --git a/tests/unittests/test_datasource/test_smartos.py b/tests/unittests/test_datasource/test_smartos.py |
194 | index 88bae5f..2bea7a1 100644 |
195 | --- a/tests/unittests/test_datasource/test_smartos.py |
196 | +++ b/tests/unittests/test_datasource/test_smartos.py |
197 | @@ -1,4 +1,5 @@ |
198 | # Copyright (C) 2013 Canonical Ltd. |
199 | +# Copyright (c) 2018, Joyent, Inc. |
200 | # |
201 | # Author: Ben Howard <ben.howard@canonical.com> |
202 | # |
203 | @@ -324,6 +325,7 @@ class PsuedoJoyentClient(object): |
204 | if data is None: |
205 | data = MOCK_RETURNS.copy() |
206 | self.data = data |
207 | + self._is_open = False |
208 | return |
209 | |
210 | def get(self, key, default=None, strip=False): |
211 | @@ -344,6 +346,14 @@ class PsuedoJoyentClient(object): |
212 | def exists(self): |
213 | return True |
214 | |
215 | + def open_transport(self): |
216 | + assert(not self._is_open) |
217 | + self._is_open = True |
218 | + |
219 | + def close_transport(self): |
220 | + assert(self._is_open) |
221 | + self._is_open = False |
222 | + |
223 | |
224 | class TestSmartOSDataSource(FilesystemMockingTestCase): |
225 | def setUp(self): |
226 | @@ -592,8 +602,46 @@ class TestSmartOSDataSource(FilesystemMockingTestCase): |
227 | mydscfg['disk_aliases']['FOO']) |
228 | |
229 | |
230 | +class ShortReader(object): |
231 | + """Implements a 'read' interface for bytes provided. |
232 | + much like io.BytesIO but the 'endbyte' acts as if EOF. |
233 | + When it is reached a short will be returned.""" |
234 | + def __init__(self, initial_bytes, endbyte=b'\0'): |
235 | + self.data = initial_bytes |
236 | + self.index = 0 |
237 | + self.len = len(self.data) |
238 | + self.endbyte = endbyte |
239 | + |
240 | + @property |
241 | + def emptied(self): |
242 | + return self.index >= self.len |
243 | + |
244 | + def read(self, size=-1): |
245 | + """Read size bytes but not past a null.""" |
246 | + if size == 0 or self.index >= self.len: |
247 | + return b'' |
248 | + |
249 | + rsize = size |
250 | + if size < 0 or size + self.index > self.len: |
251 | + rsize = self.len - self.index |
252 | + |
253 | + next_null = self.data.find(self.endbyte, self.index, rsize) |
254 | + if next_null >= 0: |
255 | + rsize = next_null - self.index + 1 |
256 | + i = self.index |
257 | + self.index += rsize |
258 | + ret = self.data[i:i + rsize] |
259 | + if len(ret) and ret[-1:] == self.endbyte: |
260 | + ret = ret[:-1] |
261 | + return ret |
262 | + |
263 | + |
264 | class TestJoyentMetadataClient(FilesystemMockingTestCase): |
265 | |
266 | + invalid = b'invalid command\n' |
267 | + failure = b'FAILURE\n' |
268 | + v2_ok = b'V2_OK\n' |
269 | + |
270 | def setUp(self): |
271 | super(TestJoyentMetadataClient, self).setUp() |
272 | |
273 | @@ -636,6 +684,11 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase): |
274 | return DataSourceSmartOS.JoyentMetadataClient( |
275 | fp=self.serial, smartos_type=DataSourceSmartOS.SMARTOS_ENV_KVM) |
276 | |
277 | + def _get_serial_client(self): |
278 | + self.serial.timeout = 1 |
279 | + return DataSourceSmartOS.JoyentMetadataSerialClient(None, |
280 | + fp=self.serial) |
281 | + |
282 | def assertEndsWith(self, haystack, prefix): |
283 | self.assertTrue(haystack.endswith(prefix), |
284 | "{0} does not end with '{1}'".format( |
285 | @@ -646,12 +699,14 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase): |
286 | "{0} does not start with '{1}'".format( |
287 | repr(haystack), prefix)) |
288 | |
289 | + def assertNoMoreSideEffects(self, obj): |
290 | + self.assertRaises(StopIteration, obj) |
291 | + |
292 | def test_get_metadata_writes_a_single_line(self): |
293 | client = self._get_client() |
294 | client.get('some_key') |
295 | self.assertEqual(1, self.serial.write.call_count) |
296 | written_line = self.serial.write.call_args[0][0] |
297 | - print(type(written_line)) |
298 | self.assertEndsWith(written_line.decode('ascii'), |
299 | b'\n'.decode('ascii')) |
300 | self.assertEqual(1, written_line.count(b'\n')) |
301 | @@ -737,6 +792,52 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase): |
302 | client._checksum = lambda _: self.response_parts['crc'] |
303 | self.assertIsNone(client.get('some_key')) |
304 | |
305 | + def test_negotiate(self): |
306 | + client = self._get_client() |
307 | + reader = ShortReader(self.v2_ok) |
308 | + client.fp.read.side_effect = reader.read |
309 | + client._negotiate() |
310 | + self.assertTrue(reader.emptied) |
311 | + |
312 | + def test_negotiate_short_response(self): |
313 | + client = self._get_client() |
314 | + # chopped '\n' from v2_ok. |
315 | + reader = ShortReader(self.v2_ok[:-1] + b'\0') |
316 | + client.fp.read.side_effect = reader.read |
317 | + self.assertRaises(DataSourceSmartOS.JoyentMetadataTimeoutException, |
318 | + client._negotiate) |
319 | + self.assertTrue(reader.emptied) |
320 | + |
321 | + def test_negotiate_bad_response(self): |
322 | + client = self._get_client() |
323 | + reader = ShortReader(b'garbage\n' + self.v2_ok) |
324 | + client.fp.read.side_effect = reader.read |
325 | + self.assertRaises(DataSourceSmartOS.JoyentMetadataFetchException, |
326 | + client._negotiate) |
327 | + self.assertEqual(self.v2_ok, client.fp.read()) |
328 | + |
329 | + def test_serial_open_transport(self): |
330 | + client = self._get_serial_client() |
331 | + reader = ShortReader(b'garbage\0' + self.invalid + self.v2_ok) |
332 | + client.fp.read.side_effect = reader.read |
333 | + client.open_transport() |
334 | + self.assertTrue(reader.emptied) |
335 | + |
336 | + def test_flush_failure(self): |
337 | + client = self._get_serial_client() |
338 | + reader = ShortReader(b'garbage' + b'\0' + self.failure + |
339 | + self.invalid + self.v2_ok) |
340 | + client.fp.read.side_effect = reader.read |
341 | + client.open_transport() |
342 | + self.assertTrue(reader.emptied) |
343 | + |
344 | + def test_flush_many_timeouts(self): |
345 | + client = self._get_serial_client() |
346 | + reader = ShortReader(b'\0' * 100 + self.invalid + self.v2_ok) |
347 | + client.fp.read.side_effect = reader.read |
348 | + client.open_transport() |
349 | + self.assertTrue(reader.emptied) |
350 | + |
351 | |
352 | class TestNetworkConversion(TestCase): |
353 | def test_convert_simple(self): |
I like it.
Good work.
I've pointed the c-i bot at it, so it will comment soon.
https:/ /jenkins. ubuntu. com/server/ job/cloud- init-ci/ 1005/
there are 2 tiny things inline.