Deep conversation about error handling in Ensemble, in the context of item [4] above. We've agreed to merge this as-is, since clearly there's a lot of work to be made for the desired end result. Jul 05 17:30:51 Let me understand this.. are you introducing this logic so that the application we'll use this to crash the agent? Jul 05 17:30:59 s/we'll/will Jul 05 17:31:14 If that's the case, as we've already agreed, that's ok for now Jul 05 17:31:15 that's one approach Jul 05 17:31:36 the other is letting the agent stop/start Jul 05 17:31:47 I just want to understand what we're doing, though, and what would be the proper way Jul 05 17:31:53 i know.. there's dead stuff on the reactor if we start/stop Jul 05 17:31:58 Not just that.. Jul 05 17:32:10 The dead stuff is not my major worry Jul 05 17:32:21 The real problem I see is that we simply can't trust any code anymore Jul 05 17:32:33 Imagine.. "Oh, I'll loop until file foo is created." Jul 05 17:32:34 NOT! Jul 05 17:32:43 The loop never dies.. Jul 05 17:32:55 *anything* we do would be entirely unreliable Jul 05 17:33:01 if it's touching a watch Jul 05 17:33:13 It's like piecemeal crashing.. Jul 05 17:33:36 right error handling is application specific Jul 05 17:34:23 with the existing watch protocols, its not an issue, except for the case of a hook currently executing (which needs cleanup even in the case of a process shutdown) Jul 05 17:35:11 I'm pretty sure I can find cases where it is an issue.. Jul 05 17:35:21 relation-get ip Jul 05 17:35:35 relation-get goes into the agent.. Jul 05 17:35:45 Never returns again.. Jul 05 17:35:52 Dead processes Jul 05 17:36:05 that's the case i just mentioned Jul 05 17:36:22 I see Jul 05 17:36:34 I read "watch currently executed" when you said hook Jul 05 17:36:56 Anyway.. that's the kind of issue Jul 05 17:37:07 There are likely more problematic cases already Jul 05 17:38:58 yup.. the interesting case to me, would be something that where we our 'interactively' waiting on a watch, which we don't have, all the existing watch apis are effectively complete on their own (ie. request / response cycles).. if we want to consider arbitrary code.. sure we can construct problem scenarios.. but the existing usage i think is okay.. Jul 05 17:39:00 * hazmat_ brainstorms Jul 05 17:39:24 Well, it's already not ok as we have just both spotted Jul 05 17:40:13 well the hook scenario is very different then the dead logic waiting issue Jul 05 17:40:19 Why? Jul 05 17:40:33 it doesn't really care if the session terminates, it just needs an active connection to zk Jul 05 17:41:06 Hmmm Jul 05 17:43:27 pretty much all of the watch protocols/callbacks are based on examining the current state, and then effecting changes for their area of responsibility to match that current state.. if their restarted, they function the same way... i don't know that we have any partial state transition that isn't resumable Jul 05 17:43:47 the only one i can think of is the upgrade case clears the flag before it performs the upgrade Jul 05 17:44:22 i agree though that restarting is probably the safest thing we can do for now Jul 05 17:48:02 Yeah, I don't think we have an option if we're not handling local errors Jul 05 17:48:19 i feel sad i haven't gotten to the watch protocol stuff, it would make decentralizing the error handling and setting up per watch cleanups a bit more formal Jul 05 17:48:59 its not clear to me how well we can do that even as is though with just session events Jul 05 17:49:11 It's just an error Jul 05 17:49:34 We should keep in mind that errors may happen on every operation that interacts with a remote system Jul 05 17:49:40 (and on many that don't, actually) Jul 05 17:49:59 well we have two types of errors, api errors, and session level event errors Jul 05 17:50:15 This is what Steve Vinoski blogs with so much anger about regarding RPC systems Jul 05 17:50:25 People (we) simply forget that it may fail Jul 05 17:50:31 the former is directly in response to an api request, and the latter is async to the api call Jul 05 17:50:52 niemeyer: indeed.. its interesting as well we don't get session expiration errors till we reconnect to zk Jul 05 17:51:28 If/when we move to Go, I really want to get this right Jul 05 17:51:33 i think being state based we're already worlds away better than the standard workflow of rpc systems Jul 05 17:51:34 Which is one reason I'm banging on it now Jul 05 17:51:48 We have to understand how we _should_ have done it Jul 05 17:51:57 Agreed Jul 05 17:52:02 well i noticed the gozk keeps a central registry of extant watches which is useful.. and perhaps more importantly its using channels here Jul 05 17:52:12 Right Jul 05 17:52:25 which allow for multiple response values, so we can stream back the session events directly to the watch logic Jul 05 17:52:28 I'm not entirely sure I'm handling session events in a manageable way, though Jul 05 17:52:40 where as in twisted we really only get one chance to fire the deferred Jul 05 17:52:48 So we should really figure the long term plan and jointly try to get it right next time Jul 05 17:53:00 Ah, good point Jul 05 17:53:06 That'll make things much easier Jul 05 17:53:11 select will also help Jul 05 17:53:13 select {...} Jul 05 17:53:25 and synchronicity too Jul 05 17:53:41 does_it_exist_now := <-exists_watch Jul 05 17:53:57 Similar to yield Jul 05 17:54:12 niemeyer: so in twisted we have two options, we can leave it be, and accumulate dead logic in the reactor and do process restart, or we can propogate fatal errors to the dead logic effectively expiring them Jul 05 17:54:23 I don't mind the dead logic Jul 05 17:54:29 In the deferred, specifically Jul 05 17:54:33 i don't either, the numbers are minimal Jul 05 17:54:38 This is not a frequent event (or won't be, for now) Jul 05 17:54:52 I mind the fact we'll have completely unpredictable behavior _after_ the logic is dead Jul 05 17:56:16 So, I'm happy with pushing [4] as is for now, and crashing it hard in case of session errors, and attempting to restart the whole process Jul 05 17:56:17 niemeyer: that's what i want an example of in the current impl.. the only scenario i could think of is the upgrade case, which we've already thought about and has well defined way for the end user to 'redo' the upgrade Jul 05 17:56:59 Look for every call path that may ever lead into a function that calls *_and_watch Jul 05 17:57:02 niemeyer: i'm just not certain we can solve this at the connection level, we need application support for the error recovery, its app specific error recovery Jul 05 17:57:12 and then consider everything it has running before reaching that stage Jul 05 17:57:21 and what would happen if it simply stopped working Jul 05 17:57:51 Sorry, can you give me 5 mins Jul 05 17:57:55 niemeyer: sure Jul 05 17:57:57 The guy that will fix the leaking pipe is here Jul 05 17:58:10 biab Jul 05 18:01:58 niemeyer: so afaics considering our existing watch entry points, it should be fine. the agent can start/stop and effectively restart the system.. the reason i prefer start/stop.. is that currently we have some in memory state that's not persisted to disk in the unit agent.. the hook scheduler specifically.. if we want that to correctly operate without missing events, we need to keep an on disk state that we can diff and reduce against zk to replay any misse Jul 05 18:01:58 events. the effective api that we export to hooks is an event based one, even if internally we are state based. Jul 05 18:02:30 do you mind if i paste this conversation and post back to the channel for interested parties? (when you get back) Jul 05 18:18:43 Yo Jul 05 18:18:47 He's fixing it.. Jul 05 18:18:52 We may have water today sitll ;-) Jul 05 18:20:01 Yeah, exactly.. Jul 05 18:20:27 The problem is that you want the state, but it's _some_ state, not all of it Jul 05 18:20:46 I have no idea about what a stop/start will do Jul 05 18:23:21 it will reconsider the state, its equiv. to a start/stop Jul 05 18:23:21 In the sense that stop may never actually be processed correctly depending on which paths it went through Jul 05 18:23:39 er. equiv to a process restart Jul 05 18:24:41 we don't have partial state modifications, so i don't see how it matter what paths it went down Jul 05 18:25:24 if we consider the current state, we should always be able to enforce the app behavior against it Jul 05 18:26:30 I don't understand what you mean by that.. I'm concerned about the logic that is executing mid-way through and being discontinued abruptly Jul 05 18:27:02 The current state is, well, not current anymore Jul 05 18:30:46 the state changes sure, but we're always operating against the state we can see at the moment, which i'm referencing as the current state Jul 05 18:31:19 Yeah, but I don't really understand your perspective on this.. Jul 05 18:31:30 If we reestablish the session, we can't simply go on Jul 05 18:31:40 It's not just about saving state to disk or not Jul 05 18:31:44 no we can't, we have to logically re start Jul 05 18:32:08 we have to re-examing the current state, set watches, and effect behavior against the current state Jul 05 18:32:19 relation-changed, etc, have to be called Jul 05 18:32:24 So what's the point? Jul 05 18:32:32 We have an uncertain state in memory Jul 05 18:32:55 its the same thing that was happening before with the application watch apis, we need to consider the first time the watch fires as reflecting of the current state of the watch.. ie. exists_and_watch.. should process exists Jul 05 18:33:05 deferred first as reflecting of the current state Jul 05 18:33:34 Yeah, so it's a crash, but we just didn't restart? Jul 05 18:34:29 niemeyer: well effectively we are.. we could actually do a restart except for that in memory state reflective of what we've already told the formula about Jul 05 18:35:09 How is that state interesting? Jul 05 18:35:29 It's out of date, and wrong.. because we there are changes upstream that got dropped Jul 05 18:35:42 niemeyer: it reflects which state changes/events we've already told the formula hooks about Jul 05 18:36:09 I understand that Jul 05 18:36:12 You said so Jul 05 18:36:32 But I'm asking how is that state that we told the formula about interesting when in fact it's out of date, and we don't know how it's out of date Jul 05 18:37:10 because it allows us to tell the formula about things it hasn't seen when they happen Jul 05 18:37:50 imagine a formula gets 3 relation joins, and then we disconnect, and while we're disconnected we get two relation-departs.. when we reconnect, we want to be able to tell the formula about those departs Jul 05 18:38:33 at some point we'll be able to access the current state, and we want to reconcile the two to paint a consistent picture/universe for the formula Jul 05 18:38:34 What about the lost event Jul 05 18:38:40 E.g. config changed Jul 05 18:38:52 E.g. relation changed Jul 05 18:39:31 for those we need to capture additional local state of what the data was, so we can do the same Jul 05 18:39:51 ie if we show config-changed, we should capture the config, and ditto for relation data Jul 05 18:40:01 and what if the application actually crashes? Jul 05 18:40:04 Then we lost those Jul 05 18:40:15 not if we're capturing them to disk Jul 05 18:40:21 Well.. precisely Jul 05 18:40:25 what if the disk crashes ;-) Jul 05 18:40:28 Nah.. Jul 05 18:40:30 I'm serious Jul 05 18:40:59 The point I'm trying to get to is that it's a very hackish approach to render all local logic unable to recover from errors Jul 05 18:41:28 and then say that we'll do a hard core stop start in-memory just because we want to access some data that should actually be on disk Jul 05 18:42:20 I'd vote for either: Jul 05 18:42:56 there's a bug to add persistence to the scheduler, and i agree that would be ideal and would allow for unit agent restarts without concern Jul 05 18:42:58 A) Let's handle local errors properly, as every application should, and do our best to recover from real hard crashes Jul 05 18:43:01 or Jul 05 18:43:26 B) Let's crash and assume we haven't thought about that properly enough at the time, and have a plan on how to get out of it Jul 05 18:43:53 so its not possible to always solve these problems locally, without taking into account the global state Jul 05 18:44:24 Hmm, ok, I don't get that Jul 05 18:44:28 How is it so? Jul 05 18:45:40 so say the session expires, we get a stream of these expiration events, and watch/local specific error handlers invoked for each.. does one of them reconnect, do they all reconnect, do they depend on each other, etc Jul 05 18:45:59 None of them should reconnect Jul 05 18:46:05 The reconnection should be a global decision Jul 05 18:46:14 The local one can decide to keep trying, or to bail out with an error Jul 05 18:46:22 Depending on whatever it's doing Jul 05 18:47:14 If it keeps trying, it should be able to timeout and/or accept an external "stop" event, which still gives it a chance to handle the situation properly Jul 05 18:48:20 Note that this approach keeps local logic alive and in control of its own faith, and enables the system to recover itself Jul 05 18:48:25 We don't even have to start/stop Jul 05 18:48:37 Or stop/start Jul 05 18:49:04 well we still need lifecycle management, the external 'stop' event, for things like clean shutdown, testing, etc Jul 05 18:49:38 Right, and that's certainly something we should support Jul 05 18:49:43 niemeyer: so this goes into what i was considering with the protocols, but that requires a more formal notion of how we structure our watches Jul 05 18:50:16 with that we can more readily apply local error handling without having to inline it to the invoker, because its encapsulated Jul 05 18:50:43 It requires a formal notion of how to handle errors, which is something we haven't been doing very well across the board. Jul 05 18:50:48 Note stuff like this: Jul 05 18:51:02 # XXX In the future we'll have to think about formula Jul 05 18:51:02 # replacements here. For now this will do, and will Jul 05 18:51:02 # explode reliably in case of conflicts. Jul 05 18:51:02 yield self._client.create("/formulas/%s" % formula_id, Jul 05 18:51:02 yaml.safe_dump(formula_data)) Jul 05 18:51:02 formula_state = FormulaState(self._client, namespace, Jul 05 18:51:02 name, revision, formula_data) Jul 05 18:51:35 "Exploding reliably" is kind of an euphemism Jul 05 18:51:44 It explodes, and possibly in multiple ways Jul 05 18:51:59 Now, look for things that call add_formula_state Jul 05 18:52:09 and evaluate how the error has been handled there Jul 05 18:52:29 hmm.. this is a different problem to me Jul 05 18:52:48 It's the same underlying problem: we haven't been handling errors very well Jul 05 18:53:00 and I'll blow it out a bit more: Python code in general doesn't Jul 05 18:53:04 well what happens in this case, if the create fails Jul 05 18:53:22 nothing, an error goes back up the app stack, in this case to the cli, and the user can try again Jul 05 18:53:37 there is no partial state modification Jul 05 18:53:39 This is not a problem in graphical applications, or console applications, or even servers that have their own state to worry about only Jul 05 18:53:52 We're in a more sensitive environment, though Jul 05 18:54:16 Ok, i picked a bad example Jul 05 18:55:27 content, stat = yield self._client.get("/topology") Jul 05 18:55:43 That's _read_topology.. Jul 05 18:55:52 The only error it catches is NoNodeException Jul 05 18:56:42 Now imagine all the things that can go wrong in this call, and all the code that depends on this logic Jul 05 18:58:07 The attitude "I will step out reliably because I can't continue what I'm doing" is simply not part of our usual way of thinking Jul 05 18:59:19 other problems besides NoNodeException, are global problems Jul 05 18:59:21 You're right that many of those errors can't be dealt with locally. The issue comes when the local guy isn't careful to clean up after itself, and the outside guy doesn't even know the local guy can blow up badly Jul 05 18:59:32 actually security issues are local as well Jul 05 19:01:14 niemeyer: agreed, a local guy who has partial state modification needs local cleanup. Jul 05 19:01:31 where state could be on disk, in zk, managing external processes etc Jul 05 19:01:36 Yeah Jul 05 19:01:50 and it needs to give a clear hint to the outside guy "Hey, things can go wrong for me! Watch out!" Jul 05 19:02:41 So that the outside guy can do the same kind of thing that the local guy does when cleaning up. Jul 05 19:02:55 and has the chance to handle, or to bail out again to the outer most handler Jul 05 19:03:24 niemeyer: one question is do we have any of those local actors atm Jul 05 19:03:37 and the second is how do we enable that cleanup Jul 05 19:03:52 The first thing is to put error handling under control Jul 05 19:04:08 if its a recoverable error, than talking to the local actor about the problem exhausts our communication channel to it Jul 05 19:04:11 and I'm not sure we can do that in Python, to be honest Jul 05 19:04:20 that's why i'm distinguishing between the two Jul 05 19:04:35 niemeyer: we can, we just need to formalize our notion of these actors a bit more Jul 05 19:05:02 and we can introduce additional communication patterns beyond just the simple deferred approach we're using now Jul 05 19:05:03 It's tricky because it's not part of the culture.. the example I just gave on read_topology is a classical one Jul 05 19:05:10 but again those are problems solved at the application level Jul 05 19:05:12 None of us even pondered about it at the time Jul 05 19:05:18 not at the connection level that we're dealing with in txzk Jul 05 19:06:48 i don't think read_topology is the issue, the basis of the state/watch protocols is being able to encompass a set of application logic associated to a behavior with lifecycle semantics and error recovery Jul 05 19:07:21 but again we're mixing up the issue at hand, which is what we want to do with txzk, and what we want to do in the application Jul 05 19:07:28 read_topology is not even about watching Jul 05 19:08:25 niemeyer: sure but any of interactions with zk can be classified as the implementation of one of a set of applications behaviors Jul 05 19:09:05 Erm.. that was a NOOP in my brain :-D Jul 05 19:09:14 doh Jul 05 19:10:26 so i'm saying putting X different error handlers on every zk interaction isn't nesc... that zk operation is part of a sequence and that sequence represents a behavior, and its really at the behavior level that we can best implement the error handler Jul 05 19:10:51 Hmmm, I see.. let me think Jul 05 19:14:08 i need to get some food on the table.. back in 15m Jul 05 19:14:17 Cool, see you soon Jul 05 19:14:18 more like 10 Jul 05 19:14:41 Let's evaluate a use case Jul 05 19:14:59 I think that'll lead us to a good idea of that stuff Jul 05 19:24:05 back Jul 05 19:35:26 so pretty much any of the existing unit agent behavior would work as a use case, resolved, upgrade, relation hook execution (which is probably the most complex) Jul 05 19:36:03 the core of these things, is get current and set watch, associate a callback to process state Jul 05 19:36:10 perhaps thats too generic Jul 05 19:40:18 Hmm Jul 05 19:42:50 niemeyer: would you be up for picking this up again tomorrow? Jul 05 19:42:52 I'm looking at the provisioning agent Jul 05 19:43:04 * hazmat_ pulls up a copy Jul 05 19:43:41 Yeah, we assume working communication across the board Jul 05 19:45:39 if not stat: Jul 05 19:45:39 environment = yield watch_d Jul 05 19:45:41 This would halt Jul 05 19:46:45 Which doesn't look like a big issue in this one case, but again.. just a coincidence that there's nothing relevant being done after it Jul 05 19:47:09 niemeyer: the key to finding problems to me is something that's dealing with partial state modification Jul 05 19:47:42 Well, to me it's a conceptual problem as well Jul 05 19:47:52 fair enough Jul 05 19:47:58 It's just not a sane way to develop reliable software Jul 05 19:48:20 We have to constantly keep in mind that some operations may halt execution partially Jul 05 19:50:13 but regarding approaching it as a set of behaviors/protocols that have recovery at the behavior level? Jul 05 19:51:41 i guess rephrasing.. what is the appropriate granularity of error recovery, is it really at the operation level, or can it be encompassed to a known operation sequence Jul 05 19:54:01 i agree that conceptually 'hoping it works' is not a scalable (either code size or team size) approach to reliability Jul 05 19:54:24 nor is periodic analysis of the problem Jul 05 19:54:31 against the codebase Jul 05 19:54:41 The appropriate level is all of them Jul 05 19:54:57 In the sense that errors have to be considered and propagated Jul 05 19:55:31 Part of our problem is that it's extremely convenient to code as if things just worked Jul 05 19:55:55 Want a hint of that? Jul 05 19:56:16 We have zero retries in our code base, besides the ssh stuff which is there because it actually failed on our face Jul 05 19:56:40 We have to be much more diligent about that, IMO Jul 05 19:59:26 Take configure_environment, for instance Jul 05 19:59:37 There's nothing there which needs state Jul 05 19:59:42 Erm, sorry Jul 05 20:00:02 Which needs session continuity Jul 05 20:00:19 We can break down, and just keep trying Jul 05 20:06:06 Should we continue tomorrow? :) Jul 05 20:09:14 niemeyer: sounds good Jul 05 20:09:26 That was a nice catch up, actually Jul 05 20:09:27 Thanks Jul 05 20:09:36 good stuff Jul 05 20:09:47 I'm happy for that branch to move on, no matter what Jul 05 20:10:19 It's clear that we won't solve the total problem without major work Jul 05 20:10:29 cool, just mark it approved and i'll proceed with the merge Jul 05 20:10:45 i put it back into the review queue, so we could make sure we're on the same page re [4] Jul 05 20:10:59 i think we're agreed that the implications need additional discussion and application level support Jul 05 20:11:14 we can revisit the connection handling if decide we need different behavior out of it Jul 05 20:11:24 Agreed Jul 05 20:11:44 more importantly it solves the transient network hiccup (non fatal session events) that clint was seeing now