Code review comment for lp:~hazmat/txzookeeper/distributed-queue

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

On Tue, 08 Jun 2010 10:28:24 -0400, Gustavo Niemeyer
<email address hidden> wrote:

> Review: Needs Fixing
> Cool, nice improvements indeed!
>
> I'm going to restart the review again, since it's a brand new
> implementation. I'll start the sequence on 12 just to avoid conflicts.
>
> [12]
>
> + def __init__(self, deferred, watcher):
> + self.deferred = deferred
> + self.watcher = watcher
> + self.processing = False
> + self.refetch = False
>
> I'm having a slightly hard time reading the code because of the
> generality of the names here. E.g. "refetch" what? "watcher" for what?
> "processing" what? Why is "processing" only turned on inside _get_item?
>

Thanks for the review. Noted on names, i'll update them. fwiw, refetch ->
children, watcher -> queue children, processing -> children

re processing inside of _get_item, i've moved it to _get

> [13]
>
> + def on_queue_items_changed(*args):
> + """Event watcher on queue node child events."""
> + if request.complete or not self._client.connected:
> + return
> +
> + if request.processing:
> + request.refetch = True
> + else:
> + self._get(request)
>
> I think I might be missing some detail about the implementation, perhaps
> due to [12].
>
> Let's say that things happen in the following order:
>
> 1) User calls get()
> 2) get() calls _get()
> 3) _get() hooks _get_item on the result, and the watcher above on changes
> 4) Result is returned, and _get_item() is queued for being called
> 5) watcher from (2) fires, and calls _get() again
> 6) Repeat (3)
> 7) Repeat (4)
>
> In other words, it looks like _get_item might end up being queued
> multiple times even before it started running the first time.

indeed, its possible. i've updated the code to set processing = True in
_get before the client call.

> [14]
>
> I find a bit suspect that we're killing the item from the queue even
> before the user had a chance to know about its existence. This will
> make the queue somewhat prone to items being eaten by crashes.

As it is right now, the queue semantics enforce isolation and minimal
communication with the zookeeper server. I noted the requirement regarding
error handling for queue consumers in the module docstring. If we want
reliable message delivery/processing and want the queue to enforce that
semantic for us, than we'll likely need some aggregate/composite node
structures, cooperative semantics, or data introspection/modification on
queue item nodes, all of which are will incur additional communication
overhead to enforce that semantic. I think those are effectively different
data structures, a queue isn't a nesc. a message queue.

> [15]
>
>
> + # Refetching deferred until we process all the children from
> + # from a get children call.
> + if request.refetch:
> + request.processing = False
> + return self._get(request)
>
> Again, going back to [12], I find the nomenclature a bit non-obvious to
> follow. Why is it not "processing" anymore on a refetch? Looks like
> there's a very tight relationship with the get children watcher in this
> logic, even though that's not being made explicit by names nor comments.

that's a bug, since fixed. the request.processing = False should be
outside and before the conditional. Effectively whenever there are no
children being processed request.processing should be false.

>
> Also, I believe there's an issue here. Let's say things happen in this
> order:
>
> 1) children is [], refetch is False, request.processing is True
> 2) on_no_node_error is called on a NoNodeException
> 3) Due to the state, the exception is swallowed and nothing really
> changes
> 4) on_queue_items_changed is called
> 5) Given the state, refetch is made True
> 6) and... nothing else ever happens?
>
> I think making names more explicit might help avoiding issues like
> these, since it'll be easier to reason about the logic there and know
> out of gut feeling when things aren't happening the way they should.

with the fix above this scenario is no longer valid, when the watch fires,
it restarts the _get.

Thanks!

« Back to merge proposal