Merge lp:~1chb1n/charm-helpers/amulet-svc-restart-race into lp:charm-helpers

Proposed by Ryan Beisner
Status: Merged
Merged at revision: 444
Proposed branch: lp:~1chb1n/charm-helpers/amulet-svc-restart-race
Merge into: lp:charm-helpers
Diff against target: 241 lines (+101/-53)
1 file modified
charmhelpers/contrib/amulet/utils.py (+101/-53)
To merge this branch: bzr merge lp:~1chb1n/charm-helpers/amulet-svc-restart-race
Reviewer Review Type Date Requested Status
Billy Olsen Approve
Adam Collard (community) Abstain
charmers Pending
Review via email: mp+269098@code.launchpad.net

Description of the change

Add exception handling to the retry logic in service_restarted_since and update _get_proc_start_time to use existing PID helper, addressing bug 1474030.

Add deprecation warn re: service_restarted amulet helper, in favor of validate_service_changed_config.

FYI - The following charm MPs for updated tests are dependent on this C-H change landing first:

https://code.launchpad.net/~1chb1n/charms/trusty/keystone/amulet-svc-check-race-fix/+merge/269501

https://code.launchpad.net/~1chb1n/charms/trusty/neutron-api/next-amulet-update/+merge/263594

https://code.launchpad.net/~1chb1n/charms/trusty/openstack-dashboard/next-amulet-fixup-1507/+merge/268930

https://code.launchpad.net/~1chb1n/charms/trusty/swift-proxy/amulet-update-1508/+merge/268790

To post a comment you must log in.
Revision history for this message
Adam Collard (adam-collard) wrote :

See https://bugs.launchpad.net/charm-helpers/+bug/1474030/comments/2 for explanation on why this isn't sufficient

review: Needs Fixing
Revision history for this message
Ryan Beisner (1chb1n) wrote :

See comment on the same bug. Thank you.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Since pgrep -p doesn't work on Precise, and we already have another helper to get a process ID, I've refactored _get_proc_start_time to just use that.

Note that that renders pgrep_full a no-op, and I've also updated the calling helpers to ensure the user receives a deprecation WARN about that, while still ensuring that the default use of the refactored _get_proc_start_time does not result in a deprecation WARN.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI, the proposed charm-helpers changes are working well in 2 charms where the race was prevalent, and in 1 charm where the race was seldom seen (keystone):

# openstack-dashboard
https://code.launchpad.net/~1chb1n/charms/trusty/openstack-dashboard/next-amulet-fixup-1507/+merge/268930

# neutron-api
https://code.launchpad.net/~1chb1n/charms/trusty/neutron-api/next-amulet-update/+merge/269302

# keystone
https://code.launchpad.net/~1chb1n/charms/trusty/keystone/amulet-svc-check-race-fix/+merge/269501

Revision history for this message
Adam Collard (adam-collard) :
review: Abstain
Revision history for this message
Ryan Beisner (1chb1n) wrote :

As we are needing to move forward with this and the child MPs, and we have addressed the original concerns, and Adam isn't in a position to be able to fully review at this time, I'm adding Wolsen back for a review.

PS. Thanks to Adam for raising that issue, so that we could propose something even more robust.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

So I've made a few (mostly) editorial comments in-line. I think at the root of whether this should go ahead and go forward or not is based upon whether there exists any implementation dependent upon the pgrep_full behavior. I'm a big +1 on the switch to the use of pidof rather than pgrep because the pgrep implementation allows charm authors to fall into an obscure bug here if they were not careful enough (though admittedly, it'd be due to some lazy coding which could be fixed via being more explicit).

The intention of the pgrep_full flag appears to be to allow amulet test authors to also search script processes rather than just executable names. The -x parameter for pidof allows the same thing to occur. If we define the pgrep_full flag as 'also match script names' rather than the literal interpretation of pgrep_full, then the flag doesn't need to be deprecated.

Most of my in-line comments are based upon the parameter being deprecated. After thinking hard, I'm not entirely sure its required to be deprecated if there is a successful argument in the intent outweighing what is conveyed by the name of the parameter.

review: Needs Fixing
Revision history for this message
Ryan Beisner (1chb1n) wrote :

I totally know what you're saying, this is a tricky one for sure. There are other methods there, intended to check the same thing, but they are also broken in some way, and are basically dying on the vine (and without user warnings).

Deprecation reasoning:

The pgrep_full option will be ill-named after switching to pidof behind the scenes.

We could preserve the parameter and just re-plumb it to pidof's -x, but that would be confusing to consume, and no longer self-descriptive, if not actually misleading.

We could add a new option, such as check_scripts_too, and perhaps just aliased pgrep_full for backward compatibility. That would still be confusing in that the pgrep_full option is out of context, and in that we are inviting user error by exposing that control in the first place.

Which brought me to: can we just get to the core of the intention of the method, and make it smart enough to just do-the-right-thing?: "has a service restarted since epoch time nnnnnnnn.nn?"

ie. my_thing_has_restarted(unit, service, since_epoch_time) It should be that simple, and yet still reliable.

We are already using pidof -x reliably in an existing is_my_service_running helper (-x always on).

And that is the journey of my arrival at: just deprecate it and announce in any case that it is defined.

I did backwards-compatibility spot-checking of this branch by syncing it into multiple charms which previously used the old method with pgrep_full defined (some with True, some with False, some not defined). The proposed refactor and deprecation just worked in all cases that I checked, resolved the race issues, and gave deprecation warnings to boot. So, in that, it reaches my initial goal of fixing this in a way that would resolve those issues with a simple c-h re-sync, without having to modify test code.

I recognize that there is a remote possibility that someone outside our awareness is using that helper function, and this could possibly break them. I think we've minimized that risk, and we've given user feedback appropriately. That's always debatable of course.

So..

What I want to avoid is adding yet-another rewrite of a my_thing_has_restarted helper method to go along with the other 2 that are present, in-use and racy. But actually, it may be a better and cleaner helper if we were to do that (from scratch). Then we could just give YMMV WARNs in the known-problematic ones.

Ideas?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

See replies inline. The more I look over this, and it's ancestors, the more I want to just revert all of this and add a new helper that just works.

Talk me out of doing that?

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Okay, let's go deprecation warnings for the parameter, which is where you went anyway. I probably should not have gone in the direction that I went in with the possibility of not deprecating the parameter. The deprecation warnings serve to notify charm authors but also to allow for this to be a drop-in replacement with as little backwards compatibility issues as possible, so it gets my upvote for that.

Please do NOT add a new option for checking scripts and just use the pidof -x. By deprecating the pgrep_full option we're favoring the desirable approach. If someone can make a really strong use case for adding another flag lets consider adding the flag or something at that point.

I haven't gone through all the ancestry for the method yet, but if we add yet another way then it becomes to confusing as to which method to use. So let's just do the right thing and fix the behavior to work as advertised/expected as this MP is doing.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thanks again for the good questions and replies and advice.

Add'l replies in-line.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Ah, thanks for the updates Ryan - I think we're now fully on the same page. Just a few minor edits then I guess and it gets my +1.

438. By Ryan Beisner

update comments, pgrep_full conditional

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Thanks for making the changes. LGTM - I added an additional comment for follow up later but it shouldn't hold this MP up in my mind.

review: Approve
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Replied inline. Thanks again!

Revision history for this message
Billy Olsen (billy-olsen) wrote :

One more back at ya. Let's get it merged.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Thanks again, are we ready to land this, so we can also land the related charm test updates?

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Yes, I think this is ready to be landed.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'charmhelpers/contrib/amulet/utils.py'
2--- charmhelpers/contrib/amulet/utils.py 2015-08-17 10:47:36 +0000
3+++ charmhelpers/contrib/amulet/utils.py 2015-09-02 01:09:48 +0000
4@@ -114,7 +114,7 @@
5 # /!\ DEPRECATION WARNING (beisner):
6 # New and existing tests should be rewritten to use
7 # validate_services_by_name() as it is aware of init systems.
8- self.log.warn('/!\\ DEPRECATION WARNING: use '
9+ self.log.warn('DEPRECATION WARNING: use '
10 'validate_services_by_name instead of validate_services '
11 'due to init system differences.')
12
13@@ -269,33 +269,52 @@
14 """Get last modification time of directory."""
15 return sentry_unit.directory_stat(directory)['mtime']
16
17- def _get_proc_start_time(self, sentry_unit, service, pgrep_full=False):
18- """Get process' start time.
19-
20- Determine start time of the process based on the last modification
21- time of the /proc/pid directory. If pgrep_full is True, the process
22- name is matched against the full command line.
23- """
24- if pgrep_full:
25- cmd = 'pgrep -o -f {}'.format(service)
26- else:
27- cmd = 'pgrep -o {}'.format(service)
28- cmd = cmd + ' | grep -v pgrep || exit 0'
29- cmd_out = sentry_unit.run(cmd)
30- self.log.debug('CMDout: ' + str(cmd_out))
31- if cmd_out[0]:
32- self.log.debug('Pid for %s %s' % (service, str(cmd_out[0])))
33- proc_dir = '/proc/{}'.format(cmd_out[0].strip())
34- return self._get_dir_mtime(sentry_unit, proc_dir)
35+ def _get_proc_start_time(self, sentry_unit, service, pgrep_full=None):
36+ """Get start time of a process based on the last modification time
37+ of the /proc/pid directory.
38+
39+ :sentry_unit: The sentry unit to check for the service on
40+ :service: service name to look for in process table
41+ :pgrep_full: [Deprecated] Use full command line search mode with pgrep
42+ :returns: epoch time of service process start
43+ :param commands: list of bash commands
44+ :param sentry_units: list of sentry unit pointers
45+ :returns: None if successful; Failure message otherwise
46+ """
47+ if pgrep_full is not None:
48+ # /!\ DEPRECATION WARNING (beisner):
49+ # No longer implemented, as pidof is now used instead of pgrep.
50+ # https://bugs.launchpad.net/charm-helpers/+bug/1474030
51+ self.log.warn('DEPRECATION WARNING: pgrep_full bool is no '
52+ 'longer implemented re: lp 1474030.')
53+
54+ pid_list = self.get_process_id_list(sentry_unit, service)
55+ pid = pid_list[0]
56+ proc_dir = '/proc/{}'.format(pid)
57+ self.log.debug('Pid for {} on {}: {}'.format(
58+ service, sentry_unit.info['unit_name'], pid))
59+
60+ return self._get_dir_mtime(sentry_unit, proc_dir)
61
62 def service_restarted(self, sentry_unit, service, filename,
63- pgrep_full=False, sleep_time=20):
64+ pgrep_full=None, sleep_time=20):
65 """Check if service was restarted.
66
67 Compare a service's start time vs a file's last modification time
68 (such as a config file for that service) to determine if the service
69 has been restarted.
70 """
71+ # /!\ DEPRECATION WARNING (beisner):
72+ # This method is prone to races in that no before-time is known.
73+ # Use validate_service_config_changed instead.
74+
75+ # NOTE(beisner) pgrep_full is no longer implemented, as pidof is now
76+ # used instead of pgrep. pgrep_full is still passed through to ensure
77+ # deprecation WARNS. lp1474030
78+ self.log.warn('DEPRECATION WARNING: use '
79+ 'validate_service_config_changed instead of '
80+ 'service_restarted due to known races.')
81+
82 time.sleep(sleep_time)
83 if (self._get_proc_start_time(sentry_unit, service, pgrep_full) >=
84 self._get_file_mtime(sentry_unit, filename)):
85@@ -304,15 +323,15 @@
86 return False
87
88 def service_restarted_since(self, sentry_unit, mtime, service,
89- pgrep_full=False, sleep_time=20,
90- retry_count=2):
91+ pgrep_full=None, sleep_time=20,
92+ retry_count=2, retry_sleep_time=30):
93 """Check if service was been started after a given time.
94
95 Args:
96 sentry_unit (sentry): The sentry unit to check for the service on
97 mtime (float): The epoch time to check against
98 service (string): service name to look for in process table
99- pgrep_full (boolean): Use full command line search mode with pgrep
100+ pgrep_full: [Deprecated] Use full command line search mode with pgrep
101 sleep_time (int): Seconds to sleep before looking for process
102 retry_count (int): If service is not found, how many times to retry
103
104@@ -321,30 +340,44 @@
105 False if service is older than mtime or if service was
106 not found.
107 """
108- self.log.debug('Checking %s restarted since %s' % (service, mtime))
109+ # NOTE(beisner) pgrep_full is no longer implemented, as pidof is now
110+ # used instead of pgrep. pgrep_full is still passed through to ensure
111+ # deprecation WARNS. lp1474030
112+
113+ unit_name = sentry_unit.info['unit_name']
114+ self.log.debug('Checking that %s service restarted since %s on '
115+ '%s' % (service, mtime, unit_name))
116 time.sleep(sleep_time)
117- proc_start_time = self._get_proc_start_time(sentry_unit, service,
118- pgrep_full)
119- while retry_count > 0 and not proc_start_time:
120- self.log.debug('No pid file found for service %s, will retry %i '
121- 'more times' % (service, retry_count))
122- time.sleep(30)
123- proc_start_time = self._get_proc_start_time(sentry_unit, service,
124- pgrep_full)
125- retry_count = retry_count - 1
126+ proc_start_time = None
127+ tries = 0
128+ while tries <= retry_count and not proc_start_time:
129+ try:
130+ proc_start_time = self._get_proc_start_time(sentry_unit,
131+ service,
132+ pgrep_full)
133+ self.log.debug('Attempt {} to get {} proc start time on {} '
134+ 'OK'.format(tries, service, unit_name))
135+ except IOError:
136+ # NOTE(beisner) - race avoidance, proc may not exist yet.
137+ # https://bugs.launchpad.net/charm-helpers/+bug/1474030
138+ self.log.debug('Attempt {} to get {} proc start time on {} '
139+ 'failed'.format(tries, service, unit_name))
140+ time.sleep(retry_sleep_time)
141+ tries += 1
142
143 if not proc_start_time:
144 self.log.warn('No proc start time found, assuming service did '
145 'not start')
146 return False
147 if proc_start_time >= mtime:
148- self.log.debug('proc start time is newer than provided mtime'
149- '(%s >= %s)' % (proc_start_time, mtime))
150+ self.log.debug('Proc start time is newer than provided mtime'
151+ '(%s >= %s) on %s (OK)' % (proc_start_time,
152+ mtime, unit_name))
153 return True
154 else:
155- self.log.warn('proc start time (%s) is older than provided mtime '
156- '(%s), service did not restart' % (proc_start_time,
157- mtime))
158+ self.log.warn('Proc start time (%s) is older than provided mtime '
159+ '(%s) on %s, service did not '
160+ 'restart' % (proc_start_time, mtime, unit_name))
161 return False
162
163 def config_updated_since(self, sentry_unit, filename, mtime,
164@@ -361,12 +394,15 @@
165 bool: True if file was modified more recently than mtime, False if
166 file was modified before mtime,
167 """
168- self.log.debug('Checking %s updated since %s' % (filename, mtime))
169+ self.log.debug('Checking that %s file updated since '
170+ '%s' % (filename, mtime))
171+ unit_name = sentry_unit.info['unit_name']
172 time.sleep(sleep_time)
173 file_mtime = self._get_file_mtime(sentry_unit, filename)
174 if file_mtime >= mtime:
175 self.log.debug('File mtime is newer than provided mtime '
176- '(%s >= %s)' % (file_mtime, mtime))
177+ '(%s >= %s) on %s (OK)' % (file_mtime, mtime,
178+ unit_name))
179 return True
180 else:
181 self.log.warn('File mtime %s is older than provided mtime %s'
182@@ -374,8 +410,9 @@
183 return False
184
185 def validate_service_config_changed(self, sentry_unit, mtime, service,
186- filename, pgrep_full=False,
187- sleep_time=20, retry_count=2):
188+ filename, pgrep_full=None,
189+ sleep_time=20, retry_count=2,
190+ retry_sleep_time=30):
191 """Check service and file were updated after mtime
192
193 Args:
194@@ -383,9 +420,10 @@
195 mtime (float): The epoch time to check against
196 service (string): service name to look for in process table
197 filename (string): The file to check mtime of
198- pgrep_full (boolean): Use full command line search mode with pgrep
199- sleep_time (int): Seconds to sleep before looking for process
200+ pgrep_full: [Deprecated] Use full command line search mode with pgrep
201+ sleep_time (int): Initial sleep in seconds to pass to test helpers
202 retry_count (int): If service is not found, how many times to retry
203+ retry_sleep_time (int): Time in seconds to wait between retries
204
205 Typical Usage:
206 u = OpenStackAmuletUtils(ERROR)
207@@ -402,15 +440,25 @@
208 mtime, False if service is older than mtime or if service was
209 not found or if filename was modified before mtime.
210 """
211- self.log.debug('Checking %s restarted since %s' % (service, mtime))
212- time.sleep(sleep_time)
213- service_restart = self.service_restarted_since(sentry_unit, mtime,
214- service,
215- pgrep_full=pgrep_full,
216- sleep_time=0,
217- retry_count=retry_count)
218- config_update = self.config_updated_since(sentry_unit, filename, mtime,
219- sleep_time=0)
220+
221+ # NOTE(beisner) pgrep_full is no longer implemented, as pidof is now
222+ # used instead of pgrep. pgrep_full is still passed through to ensure
223+ # deprecation WARNS. lp1474030
224+
225+ service_restart = self.service_restarted_since(
226+ sentry_unit, mtime,
227+ service,
228+ pgrep_full=pgrep_full,
229+ sleep_time=sleep_time,
230+ retry_count=retry_count,
231+ retry_sleep_time=retry_sleep_time)
232+
233+ config_update = self.config_updated_since(
234+ sentry_unit,
235+ filename,
236+ mtime,
237+ sleep_time=0)
238+
239 return service_restart and config_update
240
241 def get_sentry_time(self, sentry_unit):

Subscribers

People subscribed via source and target branches