Merge ~mgerdts/cloud-init:lp1667735 into cloud-init:master

Proposed by Mike Gerdts
Status: Merged
Approved by: Scott Moser
Approved revision: 076968c6efc9759d196e604df24aa1bd1701b9be
Merge reported by: Scott Moser
Merged at revision: 4c573d0e0173d2b1e99a383c54a0a6c957aa1cbb
Proposed branch: ~mgerdts/cloud-init:lp1667735
Merge into: cloud-init:master
Diff against target: 353 lines (+204/-16)
2 files modified
cloudinit/sources/DataSourceSmartOS.py (+102/-15)
tests/unittests/test_datasource/test_smartos.py (+102/-1)
Reviewer Review Type Date Requested Status
Server Team CI bot continuous-integration Approve
Scott Moser Approve
Review via email: mp+343118@code.launchpad.net

Commit message

DataSourceSmartOS: fix hang when metadata service is down

If the metadata service in the host is down while a guest that uses
DataSourceSmartOS is booting, the request from the guest falls into the
bit bucket. When the metadata service is eventually started, the guest
has no awareness of this and does not resend the request. This results in
cloud-init hanging forever with a guest reboot as the only recovery
option.

This fix updates the metadata protocol to implement the initialization
phase, just as is implemented by mdata-get and related utilities. The
initialization phase includes draining all pending data from the serial
port, writing an empty command and getting an expected error message in
reply. If the initialization phase times out, it is retried every five
seconds. Each timeout results in a warning message: "Timeout while
initializing metadata client. Is the host metadata service running?" By
default, warning messages are logged to the console, thus the reason for a
hung boot is readily apparent.

LP: #1667735

To post a comment you must log in.
Revision history for this message
Scott Moser (smoser) wrote :

I like it.
Good work.
I've pointed the c-i bot at it, so it will comment soon.

https://jenkins.ubuntu.com/server/job/cloud-init-ci/1005/

there are 2 tiny things inline.

Revision history for this message
Server Team CI bot (server-team-bot) wrote :

FAILED: Continuous integration, rev:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1005/
Executed test runs:
    SUCCESS: Checkout
    FAILED: Unit & Style Tests

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1005/rebuild

review: Needs Fixing (continuous-integration)
~mgerdts/cloud-init:lp1667735 updated
62cafc5... by Mike Gerdts

smoser and CI feedback

Revision history for this message
Mike Gerdts (mgerdts) wrote :
Download full text (6.1 KiB)

On Thu, Apr 12, 2018 at 8:34 PM, Scott Moser <email address hidden>
wrote:

> I like it.
> Good work.
> I've pointed the c-i bot at it, so it will comment soon.
>
> https://jenkins.ubuntu.com/server/job/cloud-init-ci/1005/
>
> there are 2 tiny things inline.
>

Thanks for taking a look. I address your comments below and will also
address an issue found by the CI bot.

>
>
> Diff comments:
>
> > diff --git a/cloudinit/sources/DataSourceSmartOS.py b/cloudinit/sources/
> DataSourceSmartOS.py
> > index 86bfa5d..5717dae 100644
> > --- a/cloudinit/sources/DataSourceSmartOS.py
> > +++ b/cloudinit/sources/DataSourceSmartOS.py
> > @@ -360,6 +371,45 @@ class JoyentMetadataClient(object):
> > LOG.debug('Value "%s" found.', value)
> > return value
> >
> > + def _readline(self):
> > + """
> > + Reads a line a byte at a time until \n is encountered.
> Returns an
> > + ascii string with the trailing newline removed.
> > +
> > + If a timeout (per-byte) is set and it expires, a
> > + JoyentMetadataFetchException will be thrown.
> > + """
> > + response = bytearray()
> > + while True:
> > + try:
> > + byte = self.fp.read(1)
> > + if len(byte) == 0:
> > + raise JoyentMetadataTimeoutException(
> > + "Partial response: '%s'" %
> response.decode('ascii'))
> > + if ord(byte) == ord(b'\n'):
>
> is ord needed here?
> we read a byte. can't we just compare it to b'\n' directly?
>

When reading from a serial port, that works fine. I've been unable to
figure out the magic to put bytes of the same kind on response_parts array.

With this:

diff --git a/cloudinit/sources/DataSourceSmartOS.py
b/cloudinit/sources/DataSourceSmartOS.py
index 5717dae..f368e22 100644
--- a/cloudinit/sources/DataSourceSmartOS.py
+++ b/cloudinit/sources/DataSourceSmartOS.py
@@ -383,12 +383,14 @@ class JoyentMetadataClient(object):
         while True:
             try:
                 byte = self.fp.read(1)
+ LOG.debug('read byte %r len %d newline %s', byte,
len(byte),
+ byte == b'\n')
                 if len(byte) == 0:
                     raise JoyentMetadataTimeoutException(
                         "Partial response: '%s'" %
response.decode('ascii'))
- if ord(byte) == ord(b'\n'):
+ if byte == b'\n':
                     return response.decode('ascii')
- response.extend([ord(byte)])
+ response.append(ord(byte))
             except OSError as exc:
                 if exc.errno == errno.EAGAIN:
                     raise JoyentMetadataTimeoutException(

I get:

======================================================================
ERROR: test_negotiate
(tests.unittests.test_datasource.test_smartos.TestJoyentMetadataClient)
----------------------------------------------------------------------
Traceback (most recent call last):
  File
"/home/mgerdts/cloud-init/tests/unittests/test_datasource/test_smartos.py",
line 761, in test_negotiate
    client._negotiate()
  File "/home/mgerdts/cloud-init/cloudinit/sour...

Read more...

~mgerdts/cloud-init:lp1667735 updated
5a78cc5... by Mike Gerdts

blist from smoser

2e968f1... by Mike Gerdts

smoser fixes #2

Revision history for this message
Scott Moser (smoser) wrote :

tiny nit.

Revision history for this message
Scott Moser (smoser) wrote :

https://jenkins.ubuntu.com/server/job/cloud-init-ci/1009/console

I approve of this branch, either fix the nit in my last comment or justify it.

review: Approve
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1009/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1009/rebuild

review: Approve (continuous-integration)
~mgerdts/cloud-init:lp1667735 updated
076968c... by Mike Gerdts

smoser final feedback: redundant flush

Revision history for this message
Mike Gerdts (mgerdts) wrote :

Yes, that was redundant. Fixed & retested (tox + real system). Fixed in
changeset:

commit 076968c6efc9759d196e604df24aa1bd1701b9be
Author: Mike Gerdts <email address hidden>
Date: Mon Apr 16 22:39:35 2018 -0500

    smoser final feedback: redundant flush

Thanks much for the review, help, and approval.

Mike

On Mon, Apr 16, 2018 at 7:50 PM, Scott Moser <email address hidden>
wrote:

> Review: Approve
>
> https://jenkins.ubuntu.com/server/job/cloud-init-ci/1009/console
>
> I approve of this branch, either fix the nit in my last comment or justify
> it.
>
>
> --
> https://code.launchpad.net/~mgerdts/cloud-init/+git/cloud-
> init/+merge/343118
> You are the owner of ~mgerdts/cloud-init:lp1667735.
>

Revision history for this message
Scott Moser (smoser) :
review: Approve
Revision history for this message
Server Team CI bot (server-team-bot) wrote :

PASSED: Continuous integration, rev:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1017/
Executed test runs:
    SUCCESS: Checkout
    SUCCESS: Unit & Style Tests
    SUCCESS: Ubuntu LTS: Build
    SUCCESS: Ubuntu LTS: Integration
    SUCCESS: MAAS Compatability Testing
    IN_PROGRESS: Declarative: Post Actions

Click here to trigger a rebuild:
https://jenkins.ubuntu.com/server/job/cloud-init-ci/1017/rebuild

review: Approve (continuous-integration)
Revision history for this message
Scott Moser (smoser) wrote :

An upstream commit landed for this bug.

To view that commit see the following URL:
https://git.launchpad.net/cloud-init/commit/?id=4c573d0e

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1diff --git a/cloudinit/sources/DataSourceSmartOS.py b/cloudinit/sources/DataSourceSmartOS.py
2index 86bfa5d..a904808 100644
3--- a/cloudinit/sources/DataSourceSmartOS.py
4+++ b/cloudinit/sources/DataSourceSmartOS.py
5@@ -1,4 +1,5 @@
6 # Copyright (C) 2013 Canonical Ltd.
7+# Copyright (c) 2018, Joyent, Inc.
8 #
9 # Author: Ben Howard <ben.howard@canonical.com>
10 #
11@@ -21,6 +22,7 @@
12
13 import base64
14 import binascii
15+import errno
16 import json
17 import os
18 import random
19@@ -229,6 +231,9 @@ class DataSourceSmartOS(sources.DataSource):
20 self.md_client)
21 return False
22
23+ # Open once for many requests, rather than once for each request
24+ self.md_client.open_transport()
25+
26 for ci_noun, attribute in SMARTOS_ATTRIB_MAP.items():
27 smartos_noun, strip = attribute
28 md[ci_noun] = self.md_client.get(smartos_noun, strip=strip)
29@@ -236,6 +241,8 @@ class DataSourceSmartOS(sources.DataSource):
30 for ci_noun, smartos_noun in SMARTOS_ATTRIB_JSON.items():
31 md[ci_noun] = self.md_client.get_json(smartos_noun)
32
33+ self.md_client.close_transport()
34+
35 # @datadictionary: This key may contain a program that is written
36 # to a file in the filesystem of the guest on each boot and then
37 # executed. It may be of any format that would be considered
38@@ -316,6 +323,10 @@ class JoyentMetadataFetchException(Exception):
39 pass
40
41
42+class JoyentMetadataTimeoutException(JoyentMetadataFetchException):
43+ pass
44+
45+
46 class JoyentMetadataClient(object):
47 """
48 A client implementing v2 of the Joyent Metadata Protocol Specification.
49@@ -360,6 +371,47 @@ class JoyentMetadataClient(object):
50 LOG.debug('Value "%s" found.', value)
51 return value
52
53+ def _readline(self):
54+ """
55+ Reads a line a byte at a time until \n is encountered. Returns an
56+ ascii string with the trailing newline removed.
57+
58+ If a timeout (per-byte) is set and it expires, a
59+ JoyentMetadataFetchException will be thrown.
60+ """
61+ response = []
62+
63+ def as_ascii():
64+ return b''.join(response).decode('ascii')
65+
66+ msg = "Partial response: '%s'"
67+ while True:
68+ try:
69+ byte = self.fp.read(1)
70+ if len(byte) == 0:
71+ raise JoyentMetadataTimeoutException(msg % as_ascii())
72+ if byte == b'\n':
73+ return as_ascii()
74+ response.append(byte)
75+ except OSError as exc:
76+ if exc.errno == errno.EAGAIN:
77+ raise JoyentMetadataTimeoutException(msg % as_ascii())
78+ raise
79+
80+ def _write(self, msg):
81+ self.fp.write(msg.encode('ascii'))
82+ self.fp.flush()
83+
84+ def _negotiate(self):
85+ LOG.debug('Negotiating protocol V2')
86+ self._write('NEGOTIATE V2\n')
87+ response = self._readline()
88+ LOG.debug('read "%s"', response)
89+ if response != 'V2_OK':
90+ raise JoyentMetadataFetchException(
91+ 'Invalid response "%s" to "NEGOTIATE V2"' % response)
92+ LOG.debug('Negotiation complete')
93+
94 def request(self, rtype, param=None):
95 request_id = '{0:08x}'.format(random.randint(0, 0xffffffff))
96 message_body = ' '.join((request_id, rtype,))
97@@ -374,18 +426,11 @@ class JoyentMetadataClient(object):
98 self.open_transport()
99 need_close = True
100
101- self.fp.write(msg.encode('ascii'))
102- self.fp.flush()
103-
104- response = bytearray()
105- response.extend(self.fp.read(1))
106- while response[-1:] != b'\n':
107- response.extend(self.fp.read(1))
108-
109+ self._write(msg)
110+ response = self._readline()
111 if need_close:
112 self.close_transport()
113
114- response = response.rstrip().decode('ascii')
115 LOG.debug('Read "%s" from metadata transport.', response)
116
117 if 'SUCCESS' not in response:
118@@ -450,6 +495,7 @@ class JoyentMetadataSocketClient(JoyentMetadataClient):
119 sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
120 sock.connect(self.socketpath)
121 self.fp = sock.makefile('rwb')
122+ self._negotiate()
123
124 def exists(self):
125 return os.path.exists(self.socketpath)
126@@ -459,8 +505,9 @@ class JoyentMetadataSocketClient(JoyentMetadataClient):
127
128
129 class JoyentMetadataSerialClient(JoyentMetadataClient):
130- def __init__(self, device, timeout=10, smartos_type=SMARTOS_ENV_KVM):
131- super(JoyentMetadataSerialClient, self).__init__(smartos_type)
132+ def __init__(self, device, timeout=10, smartos_type=SMARTOS_ENV_KVM,
133+ fp=None):
134+ super(JoyentMetadataSerialClient, self).__init__(smartos_type, fp)
135 self.device = device
136 self.timeout = timeout
137
138@@ -468,10 +515,50 @@ class JoyentMetadataSerialClient(JoyentMetadataClient):
139 return os.path.exists(self.device)
140
141 def open_transport(self):
142- ser = serial.Serial(self.device, timeout=self.timeout)
143- if not ser.isOpen():
144- raise SystemError("Unable to open %s" % self.device)
145- self.fp = ser
146+ if self.fp is None:
147+ ser = serial.Serial(self.device, timeout=self.timeout)
148+ if not ser.isOpen():
149+ raise SystemError("Unable to open %s" % self.device)
150+ self.fp = ser
151+ self._flush()
152+ self._negotiate()
153+
154+ def _flush(self):
155+ LOG.debug('Flushing input')
156+ # Read any pending data
157+ timeout = self.fp.timeout
158+ self.fp.timeout = 0.1
159+ while True:
160+ try:
161+ self._readline()
162+ except JoyentMetadataTimeoutException:
163+ break
164+ LOG.debug('Input empty')
165+
166+ # Send a newline and expect "invalid command". Keep trying until
167+ # successful. Retry rather frequently so that the "Is the host
168+ # metadata service running" appears on the console soon after someone
169+ # attaches in an effort to debug.
170+ if timeout > 5:
171+ self.fp.timeout = 5
172+ else:
173+ self.fp.timeout = timeout
174+ while True:
175+ LOG.debug('Writing newline, expecting "invalid command"')
176+ self._write('\n')
177+ try:
178+ response = self._readline()
179+ if response == 'invalid command':
180+ break
181+ if response == 'FAILURE':
182+ LOG.debug('Got "FAILURE". Retrying.')
183+ continue
184+ LOG.warning('Unexpected response "%s" during flush', response)
185+ except JoyentMetadataTimeoutException:
186+ LOG.warning('Timeout while initializing metadata client. ' +
187+ 'Is the host metadata service running?')
188+ LOG.debug('Got "invalid command". Flush complete.')
189+ self.fp.timeout = timeout
190
191 def __repr__(self):
192 return "%s(device=%s, timeout=%s)" % (
193diff --git a/tests/unittests/test_datasource/test_smartos.py b/tests/unittests/test_datasource/test_smartos.py
194index 88bae5f..2bea7a1 100644
195--- a/tests/unittests/test_datasource/test_smartos.py
196+++ b/tests/unittests/test_datasource/test_smartos.py
197@@ -1,4 +1,5 @@
198 # Copyright (C) 2013 Canonical Ltd.
199+# Copyright (c) 2018, Joyent, Inc.
200 #
201 # Author: Ben Howard <ben.howard@canonical.com>
202 #
203@@ -324,6 +325,7 @@ class PsuedoJoyentClient(object):
204 if data is None:
205 data = MOCK_RETURNS.copy()
206 self.data = data
207+ self._is_open = False
208 return
209
210 def get(self, key, default=None, strip=False):
211@@ -344,6 +346,14 @@ class PsuedoJoyentClient(object):
212 def exists(self):
213 return True
214
215+ def open_transport(self):
216+ assert(not self._is_open)
217+ self._is_open = True
218+
219+ def close_transport(self):
220+ assert(self._is_open)
221+ self._is_open = False
222+
223
224 class TestSmartOSDataSource(FilesystemMockingTestCase):
225 def setUp(self):
226@@ -592,8 +602,46 @@ class TestSmartOSDataSource(FilesystemMockingTestCase):
227 mydscfg['disk_aliases']['FOO'])
228
229
230+class ShortReader(object):
231+ """Implements a 'read' interface for bytes provided.
232+ much like io.BytesIO but the 'endbyte' acts as if EOF.
233+ When it is reached a short will be returned."""
234+ def __init__(self, initial_bytes, endbyte=b'\0'):
235+ self.data = initial_bytes
236+ self.index = 0
237+ self.len = len(self.data)
238+ self.endbyte = endbyte
239+
240+ @property
241+ def emptied(self):
242+ return self.index >= self.len
243+
244+ def read(self, size=-1):
245+ """Read size bytes but not past a null."""
246+ if size == 0 or self.index >= self.len:
247+ return b''
248+
249+ rsize = size
250+ if size < 0 or size + self.index > self.len:
251+ rsize = self.len - self.index
252+
253+ next_null = self.data.find(self.endbyte, self.index, rsize)
254+ if next_null >= 0:
255+ rsize = next_null - self.index + 1
256+ i = self.index
257+ self.index += rsize
258+ ret = self.data[i:i + rsize]
259+ if len(ret) and ret[-1:] == self.endbyte:
260+ ret = ret[:-1]
261+ return ret
262+
263+
264 class TestJoyentMetadataClient(FilesystemMockingTestCase):
265
266+ invalid = b'invalid command\n'
267+ failure = b'FAILURE\n'
268+ v2_ok = b'V2_OK\n'
269+
270 def setUp(self):
271 super(TestJoyentMetadataClient, self).setUp()
272
273@@ -636,6 +684,11 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase):
274 return DataSourceSmartOS.JoyentMetadataClient(
275 fp=self.serial, smartos_type=DataSourceSmartOS.SMARTOS_ENV_KVM)
276
277+ def _get_serial_client(self):
278+ self.serial.timeout = 1
279+ return DataSourceSmartOS.JoyentMetadataSerialClient(None,
280+ fp=self.serial)
281+
282 def assertEndsWith(self, haystack, prefix):
283 self.assertTrue(haystack.endswith(prefix),
284 "{0} does not end with '{1}'".format(
285@@ -646,12 +699,14 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase):
286 "{0} does not start with '{1}'".format(
287 repr(haystack), prefix))
288
289+ def assertNoMoreSideEffects(self, obj):
290+ self.assertRaises(StopIteration, obj)
291+
292 def test_get_metadata_writes_a_single_line(self):
293 client = self._get_client()
294 client.get('some_key')
295 self.assertEqual(1, self.serial.write.call_count)
296 written_line = self.serial.write.call_args[0][0]
297- print(type(written_line))
298 self.assertEndsWith(written_line.decode('ascii'),
299 b'\n'.decode('ascii'))
300 self.assertEqual(1, written_line.count(b'\n'))
301@@ -737,6 +792,52 @@ class TestJoyentMetadataClient(FilesystemMockingTestCase):
302 client._checksum = lambda _: self.response_parts['crc']
303 self.assertIsNone(client.get('some_key'))
304
305+ def test_negotiate(self):
306+ client = self._get_client()
307+ reader = ShortReader(self.v2_ok)
308+ client.fp.read.side_effect = reader.read
309+ client._negotiate()
310+ self.assertTrue(reader.emptied)
311+
312+ def test_negotiate_short_response(self):
313+ client = self._get_client()
314+ # chopped '\n' from v2_ok.
315+ reader = ShortReader(self.v2_ok[:-1] + b'\0')
316+ client.fp.read.side_effect = reader.read
317+ self.assertRaises(DataSourceSmartOS.JoyentMetadataTimeoutException,
318+ client._negotiate)
319+ self.assertTrue(reader.emptied)
320+
321+ def test_negotiate_bad_response(self):
322+ client = self._get_client()
323+ reader = ShortReader(b'garbage\n' + self.v2_ok)
324+ client.fp.read.side_effect = reader.read
325+ self.assertRaises(DataSourceSmartOS.JoyentMetadataFetchException,
326+ client._negotiate)
327+ self.assertEqual(self.v2_ok, client.fp.read())
328+
329+ def test_serial_open_transport(self):
330+ client = self._get_serial_client()
331+ reader = ShortReader(b'garbage\0' + self.invalid + self.v2_ok)
332+ client.fp.read.side_effect = reader.read
333+ client.open_transport()
334+ self.assertTrue(reader.emptied)
335+
336+ def test_flush_failure(self):
337+ client = self._get_serial_client()
338+ reader = ShortReader(b'garbage' + b'\0' + self.failure +
339+ self.invalid + self.v2_ok)
340+ client.fp.read.side_effect = reader.read
341+ client.open_transport()
342+ self.assertTrue(reader.emptied)
343+
344+ def test_flush_many_timeouts(self):
345+ client = self._get_serial_client()
346+ reader = ShortReader(b'\0' * 100 + self.invalid + self.v2_ok)
347+ client.fp.read.side_effect = reader.read
348+ client.open_transport()
349+ self.assertTrue(reader.emptied)
350+
351
352 class TestNetworkConversion(TestCase):
353 def test_convert_simple(self):

Subscribers

People subscribed via source and target branches