Merge lp:~doanac/utah/uuid_check_rsyslog into lp:utah

Proposed by Andy Doan
Status: Merged
Approved by: Javier Collado
Approved revision: 892
Merged at revision: 913
Proposed branch: lp:~doanac/utah/uuid_check_rsyslog
Merge into: lp:utah
Diff against target: 350 lines (+89/-37) (has conflicts)
8 files modified
templates/install-commands.jinja2 (+23/-0)
tests/test_rsyslog.py (+34/-11)
utah/config.py (+4/-7)
utah/provisioning/baremetal/bamboofeeder.py (+1/-1)
utah/provisioning/baremetal/cobbler.py (+1/-1)
utah/provisioning/provisioning.py (+0/-12)
utah/provisioning/rsyslog.py (+25/-4)
utah/provisioning/vm.py (+1/-1)
Text conflict in utah/provisioning/rsyslog.py
To merge this branch: bzr merge lp:~doanac/utah/uuid_check_rsyslog
Reviewer Review Type Date Requested Status
Andy Doan (community) Needs Resubmitting
Review via email: mp+162423@code.launchpad.net

Description of the change

get the uuid from rsyslog at system boot rather than using ssh.

To post a comment you must log in.
Revision history for this message
Max Brustkern (nuclearbob) wrote :

I've read that $() is better than `` for portability, but we do appear to be using `` in other spots in the code, so it's obviously not breaking anything right now.

I have two more substantive concerns. Are we going to receive a UUID on systems where the wait_for_booted isn't working? Is it safe to change the incorrect uuid behavior from an Exception to logging an error? The intent of this check is to ensure we're actually running on the system we installed, and I don't think logging an error and then continuing the test is necessarily an appropriate response. We could consider doing a retry, but I think continuing is a problem.

Revision history for this message
Javier Collado (javier.collado) wrote :

Line 228 in the diff will conflict with:
https://code.launchpad.net/~javier.collado/utah/bug1176076/+merge/162382

In this merge proposal 'match.string' is used while 'pattern' is used in mine.

Any of them is probably fine, but I think 'pattern' is slightly better because the error message says "matched future pattern" instead of something like "matched text". Anyway, I think that change should be removed from this merge request since it's already included in the other one. Do you agree on this?

Aside from this, it isn't completely clear to me what the 'uuid: True' line in the configuration file does. Maybe and explanation about it should be added to the docstring in wait_for_install where an example of valid steps is provided.

Revision history for this message
Andy Doan (doanac) wrote :

On 05/03/2013 10:29 AM, Max Brustkern wrote:

> I have two more substantive concerns. Are we going to receive a UUID
> on systems where the wait_for_booted isn't working?

no

> Is it safe to
> change the incorrect uuid behavior from an Exception to logging an
> error? The intent of this check is to ensure we're actually running
> on the system we installed, and I don't think logging an error and
> then continuing the test is necessarily an appropriate response. We
> could consider doing a retry, but I think continuing is a problem.

so - I'm probably downplaying the importance of the UUID here.
Originally I liked this better but this is probably worth debate.

The *real* reason I shuffled this around is because this need to make an
SSH connection to the target after install gets in the way of doing an
autostart of UTAH from rc.local. So if we make a retry type thing here,
its still going to be a problem for my next planned patchset.

Revision history for this message
Andy Doan (doanac) wrote :

On 05/03/2013 10:30 AM, Javier Collado wrote:
> Any of them is probably fine, but I think 'pattern' is slightly better because the error message says "matched future pattern" instead of something like "matched text". Anyway, I think that change should be removed from this merge request since it's already included in the other one. Do you agree on this?

your way is better.

> Aside from this, it isn't completely clear to me what the 'uuid: True' line in the configuration file does. Maybe and explanation about it should be added to the docstring in wait_for_install where an example of valid steps is provided.

+1

lp:~doanac/utah/uuid_check_rsyslog updated
888. By Andy Doan

as per max's review

this raises an exception when UUID mismatches since its a provisioning
issue.

889. By Andy Doan

add a comment about uuid conf as per javier

890. By Andy Doan

improve timeout logic with regards to future_patterns

The future pattern logic works correctly. However, it hinges on the
pattern being hit within the limits of the previous timeout interval.
This change adds up the future timeouts and makes us wait that whole
duration so we can properly detect future messages. This happens most
often on the wait_for_booted logic.

891. By Andy Doan

wait for network before running install commands

in the case of this running during the installer, nothing will change.
in the case of booting an image and running from rc.local, we should
be safer now and always get the rsyslog messages.

This seems to fix the issues we had with wait_for_booted not always
working.

Revision history for this message
Andy Doan (doanac) wrote :

I think this fixes the main issues.

verified with saucy server/desktop

revno 891 - is pretty big fix. This seems to ensure rc.local's rsyslog messages will make it to UTAH.
revno 890 - makes our wait_for timeout smarter by taking into account future pattern timeout values.

review: Needs Resubmitting
lp:~doanac/utah/uuid_check_rsyslog updated
892. By Andy Doan

restart rsyslog when network comes alive

I've seen an issue mostly with server images and rsyslog. The issue
is that rsyslog starts running before the network is up. It seems
to detect this and at some point stop trying to send things to the
remote syslog server (or at least waits a bit before trying again).
This is causing our first few syslog messages to not always show up,
and these messages are critical for UTAH to understand how things
are working.

This restarts rsyslogd after we know the network is up so that we
can reliably get these messages.

Revision history for this message
Andy Doan (doanac) wrote :

This branch is stable now and has been tested most of the week. revno 892 seems to be the magic commit that makes things reliable.

review: Needs Resubmitting
Revision history for this message
Max Brustkern (nuclearbob) wrote :

If rsyslog is now reliable enough for the uuid check, we should remove the workaround to log a warning instead of timing out during wait_for_booted. I think in the current code, a timeout would issue a warning and then possibly cause the system to proceed with no uuid check occurring.

Revision history for this message
Javier Collado (javier.collado) wrote :

I've tested with a desktop i386 image and the pass runlist and it worked fine.
However, I have the feeling that there's some duplication in the code that is a
little bit confusing. In particular, I've seen that the machine UUID is written
twice to the syslog file.

The reason for this is that the commands in `install-commands.jinja2` are
written both to the late command script and also to `/etc/rc.local`. Given that
UUID is expected to be captured during the boot sequence, I believe it should
only be in `/etc/rc.local`, not in the late command script.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

Maybe we should move the install commands into rc.local only? I think we've been discussing relying less on the preseed for a while.

Revision history for this message
Andy Doan (doanac) wrote :

On 05/16/2013 10:35 AM, Max Brustkern wrote:
> If rsyslog is now reliable enough for the uuid check, we should remove the workaround to log a warning instead of timing out during wait_for_booted. I think in the current code, a timeout would issue a warning and then possibly cause the system to proceed with no uuid check occurring.

I *think* it probably is, but I was kind of wanting to see this in wild
first ie in 0.13 and then make that an error for 0.14 just to be safe.

Revision history for this message
Andy Doan (doanac) wrote :

On 05/16/2013 11:15 AM, Javier Collado wrote:
> I've tested with a desktop i386 image and the pass runlist and it worked fine.
> However, I have the feeling that there's some duplication in the code that is a
> little bit confusing. In particular, I've seen that the machine UUID is written
> twice to the syslog file.
>
> The reason for this is that the commands in `install-commands.jinja2` are
> written both to the late command script and also to `/etc/rc.local`. Given that
> UUID is expected to be captured during the boot sequence, I believe it should
> only be in `/etc/rc.local`, not in the late command script.

I think sending it twice is harmless. I have another branch that's
dependent on this that does some more refactoring around this area that
might be a better place to address.

On 05/16/2013 01:33 PM, Max Brustkern wrote:
 > Maybe we should move the install commands into rc.local only? I
 > think we've been discussing relying less on the preseed for a while.

I think this is what we should shoot for. I'm not totally sure we should
do it in this change though?

Revision history for this message
Max Brustkern (nuclearbob) wrote :

> On 05/16/2013 11:15 AM, Javier Collado wrote:
> > I've tested with a desktop i386 image and the pass runlist and it worked
> fine.
> > However, I have the feeling that there's some duplication in the code that
> is a
> > little bit confusing. In particular, I've seen that the machine UUID is
> written
> > twice to the syslog file.
> >
> > The reason for this is that the commands in `install-commands.jinja2` are
> > written both to the late command script and also to `/etc/rc.local`. Given
> that
> > UUID is expected to be captured during the boot sequence, I believe it
> should
> > only be in `/etc/rc.local`, not in the late command script.
>

I think you may be right about that being out of scope here, but I'm also worried about a temporary situation that's less reliable than where we were before and where we're going. Am I crazy for imagining a situation where some rogue preseed fakes us into thinking that the install is done in the latecommand, and then we check the uuid, declare it valid, and then reboot into a system that hasn't installed correctly? That seems pretty unlikely I guess.
> I think sending it twice is harmless. I have another branch that's
> dependent on this that does some more refactoring around this area that
> might be a better place to address.
>
> On 05/16/2013 01:33 PM, Max Brustkern wrote:
> > Maybe we should move the install commands into rc.local only? I
> > think we've been discussing relying less on the preseed for a while.
>
> I think this is what we should shoot for. I'm not totally sure we should
> do it in this change though?

Revision history for this message
Andy Doan (doanac) wrote :

On 05/16/2013 03:03 PM, Max Brustkern wrote:
> I think you may be right about that being out of scope here, but I'm
> also worried about a temporary situation that's less reliable than
> where we were before and where we're going.

sounds good

> Am I crazy for imagining
> a situation where some rogue preseed fakes us into thinking that the
> install is done in the latecommand, and then we check the uuid,
> declare it valid, and then reboot into a system that hasn't installed
> correctly?

I don't think this is possible. We only check for UUID in the
boot_steps, not the install_steps. So UUID is basically ignored if
printed during the install.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

> > Am I crazy for imagining
> > a situation where some rogue preseed fakes us into thinking that the
> > install is done in the latecommand, and then we check the uuid,
> > declare it valid, and then reboot into a system that hasn't installed
> > correctly?
>
> I don't think this is possible. We only check for UUID in the
> boot_steps, not the install_steps. So UUID is basically ignored if
> printed during the install.

That's what I'm trying to figure out: is there any way we could think the install is done before it is?

Revision history for this message
Andy Doan (doanac) wrote :

On 05/16/2013 03:17 PM, Max Brustkern wrote:
> That's what I'm trying to figure out: is there any way we could think the install is done before it is?

I don't see how.

Revision history for this message
Javier Collado (javier.collado) wrote :

> I don't see how.

I think I agree on this because of the separation of the install and boot
steps. Otherwise, it would be possible to match a pattern in the future for the
first uuid that is printed and consider that the boot up is done.

Revision history for this message
Max Brustkern (nuclearbob) wrote :

Okay, we should be in good shape to merge, then.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'templates/install-commands.jinja2'
2--- templates/install-commands.jinja2 2013-02-27 11:10:59 +0000
3+++ templates/install-commands.jinja2 2013-05-10 15:36:27 +0000
4@@ -7,6 +7,24 @@
5 {{ check_locks_command() | indent }}
6 }
7
8+wait_for_net()
9+{
10+ log "waiting for network"
11+ timeout 2m sh <<EOF
12+ gw=""
13+ while [ -z "\$gw" ] ; do
14+ sleep 2
15+ gw=\`ip route | grep default\`
16+ done
17+
18+ # rsyslog may start skipping remote logging if it got too many
19+ # messages to send before the network was up. This makes it
20+ # start afresh and deliver our messages.
21+ restart rsyslog
22+ sleep 1
23+EOF
24+}
25+
26 install_package() {
27 package=$1
28 check_locks
29@@ -22,5 +40,10 @@
30 done
31 }
32
33+wait_for_net
34 install_packages {{packages | join(' ')}}
35+
36+# NOTE: do syslog at the end, so we do not try and start connecting via
37+# ssh or running other stuff before the target is ready
38+log "Machine UUID: `cat /etc/utah/uuid`"
39 {%- endmacro %}
40
41=== modified file 'tests/test_rsyslog.py'
42--- tests/test_rsyslog.py 2013-05-01 20:51:33 +0000
43+++ tests/test_rsyslog.py 2013-05-10 15:36:27 +0000
44@@ -18,15 +18,18 @@
45
46 import itertools
47 import logging
48+import mock
49 import socket
50 import time
51 import tempfile
52 import threading
53 import unittest
54+import uuid
55
56 from mock import patch
57
58 from utah.exceptions import UTAHException
59+from utah.provisioning.exceptions import UTAHProvisioningException
60 from utah.provisioning.rsyslog import RSyslog
61 from utah.timeout import UTAHTimeout
62
63@@ -109,7 +112,9 @@
64 {
65 "message": "test_future",
66 "pattern": ".*log source = /proc/kmsg started",
67- "timeout": 120,
68+ # set a really small timeout to prove we properly calculate
69+ # future timeout logic
70+ "timeout": 0.01,
71 "booted": True,
72 },
73 {
74@@ -194,24 +199,39 @@
75 self.assertTrue(self.test_callbacks_booted, 'booted callback not made')
76 self.assertTrue(self.test_callbacks_blah, 'blah callback not made')
77
78- def test_booted(self):
79+ @mock.patch('test_rsyslog.TestRsyslog.logger')
80+ def test_booted(self, logger):
81 """Make sure wait_for_booted works."""
82+ tst_uuid = str(uuid.uuid4())
83 steps = [
84 {
85 "message": "test booted",
86 "pattern": ".*log source = /proc/kmsg started",
87 "timeout": 120,
88 },
89+ {
90+ "message": "test uuid",
91+ "uuid": True,
92+ "pattern": '.*Machine UUID:\s+(.*)',
93+ "timeout": 2,
94+ },
95 ]
96
97 messages = [
98 'test 1',
99 '<13>Feb 20 13:38:55 host user: log source = /proc/kmsg started',
100+ '<13>Feb 21 13:40:55 host user: Machine UUID: {}'.format(tst_uuid),
101 ]
102
103 r = RSyslog("utah-test", '/tmp')
104- threading.Thread(target=self.producer, args=(r.port, messages)).start()
105- r.wait_for_booted(steps)
106+ r.logger = logger
107+ threading.Thread(target=self.producer, args=(r.port, messages)).start()
108+ r.wait_for_booted(steps, tst_uuid)
109+ self.assertFalse(logger.error.called)
110+
111+ threading.Thread(target=self.producer, args=(r.port, messages)).start()
112+ with self.assertRaises(UTAHProvisioningException):
113+ r.wait_for_booted(steps, 'bad')
114
115 def test_booted_workaround(self):
116 """Test that wait_for_booted doesn't raise an exception on timeout."""
117@@ -225,7 +245,7 @@
118
119 r = RSyslog("utah-test", '/tmp')
120
121- r.wait_for_booted(steps)
122+ r.wait_for_booted(steps, None)
123
124 def usefile(self, truncate):
125 """Make sure wait_for_booted works when tailing a file."""
126@@ -246,10 +266,12 @@
127
128 with tempfile.NamedTemporaryFile() as f:
129 r = RSyslog("utah-test", '/tmp', f.name)
130- threading.Thread(
131+ t = threading.Thread(
132 target=self.file_producer,
133- args=(f, messages, truncate)).start()
134- r.wait_for_booted(steps)
135+ args=(f, messages, truncate))
136+ t.start()
137+ r.wait_for_booted(steps, None)
138+ t.join()
139
140 def test_usefile(self):
141 """Test tailing a file."""
142@@ -270,18 +292,19 @@
143 {
144 "message": "message_2",
145 "pattern": "pattern",
146+ "timeout": 1,
147 },
148 {
149 "message": "message_3",
150 "pattern": "pattern",
151+ "timeout": 1,
152 },
153 ]
154
155 messages = itertools.repeat('log example')
156- pattern = (r'Timeout \({}\) occurred for {} message.\n'
157+ pattern = (r'Timeout \(\d+\) occurred for {} message.\n'
158 r'Remaining messages: {}'
159- .format(steps[0]['timeout'],
160- steps[0]['message'],
161+ .format(steps[0]['message'],
162 ', '.join([step['message']
163 for step in steps[1:]])))
164
165
166=== modified file 'utah/config.py'
167--- utah/config.py 2013-05-03 23:07:50 +0000
168+++ utah/config.py 2013-05-10 15:36:27 +0000
169@@ -228,14 +228,11 @@
170 'timeout': 180,
171 },
172 {
173- 'message': 'network active',
174- 'pattern': [
175- '.*NetworkManager',
176- '.*dhclient: bound to', # needed for minimal server installs
177- ],
178-
179+ 'message': 'UTAH uuid received',
180+ 'uuid': True, # binds this to the callback for wait_for_booted
181+ 'pattern': '.*Machine UUID:\s+(.*)',
182 'timeout': 180,
183- }
184+ },
185 ],
186 )
187
188
189=== modified file 'utah/provisioning/baremetal/bamboofeeder.py'
190--- utah/provisioning/baremetal/bamboofeeder.py 2013-04-26 13:36:08 +0000
191+++ utah/provisioning/baremetal/bamboofeeder.py 2013-05-10 15:36:27 +0000
192@@ -276,7 +276,7 @@
193 self.logger.info('System installing')
194
195 self.rsyslog.wait_for_install(config.install_steps)
196- self.rsyslog.wait_for_booted(config.boot_steps)
197+ self.rsyslog.wait_for_booted(config.boot_steps, self.uuid)
198 retry(self.sshcheck, logmethod=self.logger.info,
199 retry_timeout=config.checktimeout)
200
201
202=== modified file 'utah/provisioning/baremetal/cobbler.py'
203--- utah/provisioning/baremetal/cobbler.py 2013-04-26 13:36:08 +0000
204+++ utah/provisioning/baremetal/cobbler.py 2013-05-10 15:36:27 +0000
205@@ -206,7 +206,7 @@
206 # ps-unity-autopilot-release-testing/140/label=autopilot-nvidia/
207 # For now, it seems to be a problem on all installs, so we're going to
208 # handle it in rsyslog.py
209- self.rsyslog.wait_for_booted(config.boot_steps)
210+ self.rsyslog.wait_for_booted(config.boot_steps, self.uuid)
211
212 retry(self.sshcheck, logmethod=self.logger.info,
213 retry_timeout=config.checktimeout)
214
215=== modified file 'utah/provisioning/provisioning.py'
216--- utah/provisioning/provisioning.py 2013-05-03 23:08:37 +0000
217+++ utah/provisioning/provisioning.py 2013-05-10 15:36:27 +0000
218@@ -398,7 +398,6 @@
219 self.logger.debug('Failed to load machine: %s', str(err))
220 self._create()
221 self.provisioned = True
222- self._uuid_check()
223
224 def _create(self):
225 # TODO: discuss separation of this vs. start when separating Install
226@@ -590,17 +589,6 @@
227 if self.clean:
228 cleanup.add_command(cmd)
229
230- def _uuid_check(self):
231- """Check uuid to verify we've installed correctly."""
232- uuid_check_command = ('[ "{uuid}" == "$(cat /etc/utah/uuid)" ]'
233- .format(uuid=self.uuid))
234- if self.run(uuid_check_command)[0] != 0:
235- self.provisioned = False
236- raise UTAHProvisioningException(
237- 'Installed UUID differs from Machine UUID; '
238- 'installation probably failed. '
239- 'Try rerunning install.')
240-
241
242 class CustomInstallMixin(object):
243
244
245=== modified file 'utah/provisioning/rsyslog.py'
246--- utah/provisioning/rsyslog.py 2013-05-03 15:58:24 +0000
247+++ utah/provisioning/rsyslog.py 2013-05-10 15:36:27 +0000
248@@ -24,6 +24,7 @@
249 import time
250
251 from utah.exceptions import UTAHException
252+from utah.provisioning.exceptions import UTAHProvisioningException
253 from utah.timeout import UTAHTimeout
254
255
256@@ -113,7 +114,7 @@
257 callbacks = {'booted': booted_callback}
258 self._wait_for_steps(steps, logfile, callbacks)
259
260- def wait_for_booted(self, steps):
261+ def wait_for_booted(self, steps, uuid):
262 """Monitor rsyslog during boot up.
263
264 Works the same as the :meth:`wait_for_install` method but takes in
265@@ -122,16 +123,29 @@
266
267 :param steps: Set of steps that the boot up will go through.
268 :type steps: list(dict)
269+ :param uuid: The UUID generated by provisioning code.
270+ :type uuid: string
271
272 .. seealso:: :meth:`wait_for_install`
273
274 """
275 logfile = '{}/{}-boot.log'.format(self._logpath, self._host)
276- callbacks = {}
277+
278+ def _cb(match):
279+ if match:
280+ _cb.uuid = match.group(1).strip()
281+
282+ _cb.uuid = None
283+ callbacks = {'uuid': _cb}
284 # TODO: Resolve the issues around this and get it back to normal
285 # i.e. LP#1100386
286 try:
287 self._wait_for_steps(steps, logfile, callbacks)
288+ if not _cb.uuid:
289+ self.logger.error('UUID not sent from target')
290+ elif uuid and uuid != _cb.uuid:
291+ msg = 'Incorrect UUID: {} != {}'.format(uuid, _cb.uuid)
292+ raise UTAHProvisioningException(msg)
293 except UTAHTimeout:
294 self.logger.warning('Timed out waiting for boot, but continuing')
295
296@@ -149,9 +163,10 @@
297 if not isinstance(fail_pattern, list):
298 fail_pattern = [fail_pattern]
299 pattern.extend(fail_pattern)
300- future_pats = self._future_patterns(steps, x)
301+ future_pats, future_timeout = self._future_patterns(steps, x)
302 pattern.extend(future_pats)
303 self.logger.info('Waiting %ds for: %s', timeout, message)
304+ timeout += future_timeout
305 pattern, match = self._wait_for(f, pattern, message, timeout)
306 if match is None:
307 remaining_messages = [step['message']
308@@ -170,7 +185,11 @@
309 .format(pattern))
310 if pattern in future_pats:
311 msg = 'Expected pattern missed, matched future pattern: %s'
312+<<<<<<< TREE
313 self.logger.warn(msg, pattern)
314+=======
315+ self.logger.warn(msg, match.string)
316+>>>>>>> MERGE-SOURCE
317 x = self._fast_forward(steps, pattern, match, callbacks)
318 else:
319 self.logger.info('Matched pattern %r for %r message',
320@@ -188,14 +207,16 @@
321 @staticmethod
322 def _future_patterns(steps, index):
323 patterns = []
324+ timeout = 0
325 index += 1
326 for step in steps[index:]:
327 pattern = step['pattern']
328+ timeout += step['timeout']
329 if not isinstance(pattern, list):
330 patterns.append(pattern)
331 else:
332 patterns.extend(pattern)
333- return patterns
334+ return patterns, timeout
335
336 @staticmethod
337 def _fast_forward(steps, pattern, match, callbacks):
338
339=== modified file 'utah/provisioning/vm.py'
340--- utah/provisioning/vm.py 2013-05-09 19:51:42 +0000
341+++ utah/provisioning/vm.py 2013-05-10 15:36:27 +0000
342@@ -546,7 +546,7 @@
343 else:
344 raise UTAHVMProvisioningException('Failed to provision VM')
345
346- self.rsyslog.wait_for_booted(config.boot_steps)
347+ self.rsyslog.wait_for_booted(config.boot_steps, self.uuid)
348 try:
349 self.logger.info(
350 'Waiting %d seconds for ping response', self.boottimeout)

Subscribers

People subscribed via source and target branches

to all changes: