On Oct 03, 2012, at 12:34 PM, Robert Bruce Park wrote:

>I actually verified that the OAuth signatures worked for *every* API endpoint
>*except* sending new direct messages (we can receive them ok, too). That's
>what debug_twitter_live.py was all about, every time I'd write a new protocol
>operation, I'd modify it to do that new operation, then run it, and I'd see
>on my live twitter account that it worked.

You may have to engage with the Twitter developer community to debug this.  I
didn't find anything relevant in a very little bit of googling though.  You'd
think that if it were a bug in their API, others would have noticed it.

>I hadn't originally. At first it just started off as 'from
>gwibber.utils.account import AccountManager; a = AccountManager(None);

Just FWIW, I have a patch in my facebook.working branch that makes None the
default argument.  You have to be careful not to call the callback if it's
None.  I wonder if it's useful to land that in the py3 branch separately?

>In fact, *all* of the style nits you mention were written by ken ;-)

/me hears Kirk's voice scream to the heavens "Keeeeeeennnnnn!"

>I haven't looked closely at Twitter's rate limiting standards yet, but I
>wrote that comment because I ran into Twitter's limit at one point and they
>stopped me from downloading tweets that I wanted.

I think Twitter warns you (via response headers?) if you're getting rate
limited.  So I guess you could check for those and automatically back-off if
you see them.  It's probably better to at least have some combination of
proactive and reactive response to rate limiting.  The gory details:

https://dev.twitter.com/docs/rate-limiting

>>> +        # "Client" == "Consumer" in oauthlib parlance.
>>> +        client_key = self._account.auth.parameters['ConsumerKey']
>>> +        client_secret = self._account.auth.parameters['ConsumerSecret']
>>
>> Can these fail?  Are the other parts of your code prepared to handle the
>> resulting KeyErrors if these parameters are missing?  Do you have any tests
>> for those situations?
>
>No, these are constants defined by Twitter at app-registration time, stored
>in libaccounts-sso package. Can never fail. They may change in the event that
>somebody nefarious compromises them (which ought to be easy to do for an
>open-source package), but they won't change during runtime.

Cool.  In that case, the code above is fine because if they *do* fail, you
want to know via the logged KeyErrors.

>>> +    def _publish_tweet(self, tweet):
>>> +        """Publish a single tweet into the Dee.SharedModel."""
>>> +        tweet_id = tweet.get('id_str', '')
>>
>> Does it still make sense to publish this tweet if there is no id_str?
>> Won't that possibly result in row collisions, since the only unique item in
>> the key will be the empty string for multiple id_str-less entries?
>>
>> I think in my Facebook branch, I just skip the JSON response entry if
>> there's no message_id.
>
>Yeah, that's probably reasonable. I just write 'adict.get' out of habit for
>the sake of being as flexible as possible.

As discussed over IRC, something like:

    tweet_id = tweet.get('id_str')
    if tweet_id is None:
        log.info('WTF? no id_str')
        # skip this entry
        continue # or whatever

>>> +# https://dev.twitter.com/docs/api/1.1/get/statuses/mentions_timeline
>>> +    @feature
>>> +    def mentions(self):
>>> +        """Gather the tweets that mention us."""
>>> +        url = self._timeline.format('mentions')
>>> +        for tweet in self._get_url(url):
>>> +            self._publish_tweet(tweet)
>>
>> Do you have tests for each of these?  I know they *look* ridiculously
>> similar, but each should have a test.
>
>I wasn't really sure how to test them, to be honest.

It's probably good enough to use the same, or very similar sample data, but
just call .mentions(), etc. in the tests.

Really, the only thing you care about is that you've tested the basic code in
this (and similar) methods, not in anything it calls.  Let's say for example
that you - or a future developer - typo'd "self._publish_twit()".  That's the
kind of thing you want these tests to catch.

>'test_home' is really an indirect test of _get_url and _publish_tweet, and
>then these individual operations, I wasn't sure how to test them without
>touching the live service. Or the opposite way, a nearly no-op test that
>mocks out *everything* so aggressively and then just tests that it calls the
>method that it calls. I guess that's what you're looking for, then? Mock out
>self._get_url and self._publish_tweet and just test that _get_url was called
>with the right URL, and then _publish_tweet was called one or more times?

You probably don't need to mock _publish_tweet() since you can easily test the
actual results of that by looking at the model.  We already have mocks that
should have the effect of mocking _get_url(), but yeah, that's the level at
which I'd test it.  Maybe also check that the url gotten was the one you
expect.

>> Also, what happens if get_json() returns an error?  Are these methods
>> prepared to handle problems in communicating with the service?  Of course,
>> they're going to be run in a sub-thread, but is there something better you
>> can do than let the sub-thread exceptions just fill up the log file?  Can
>> you maybe catch exceptions and print something more useful to the log file?
>
>Well, I forget the exact error code, but in the event that we hit the rate
>limit, get_json will raise an exception with an appropriate HTTP error code
>and a meaningful string along with it, such as 'rate limit exceeded' or
>something similar. I didn't really see any issue with that just bubbling
>straight up into the log file. No need to duplicate the same error-checking
>code directly in every single method.

Okay.

>> Also, this seems like quite a common idiom:
>>
>>          for tweet in self._get_url(url):
>>              self._publish_tweet(tweet)
>>
>> Is there any opportunity for refactoring some commonality?
>
>Originally I had written the method as _publish_tweets (plural) and it was
>built-in with the ability to iterate over multiple tweets. That worked for a
>good long while, but then there were just a couple API endpoints that only
>returned a single tweet object rather than a list of tweets, and so then I
>needed to refactor the method to be able to handle both cases, and this is
>what I came up with.

Cool.

>If you really want, we can have something like this:
>
>def _publish_tweets(self, url):
>     for tweet in self._get_url(url):
>         self._publish_tweet(tweet)
>
>But I just thought it was too simple to even bother.

Yeah, I think what you have is fine.  With the additional tests described
above, you should be good to go.

>>> +    @feature
>>> +    def receive(self):
>>> +        """Gather and publish all incoming messages."""
>>> +        # TODO I know mentions and lists are actually incorporated
>>> +        # within the home timeline, but calling them explicitely will

Oh, btw.  s/explicitely/explicitly/

>>> +        # ensure that more of those types of messages appear in the
>>> +        # timeline (eg, so they don't get drown out by everything
>>> +        # else). I'm not sure how necessary it is, though.
>>> +        self.home()
>>> +        self.mentions()
>>> +        self.lists()
>>> +        self.private()
>>
>> What happens if you get an exception in one of these streams?  Does it make
>> sense to still try to publish the other streams?
>
>What kind of exceptions are you suspecting there to be?

Well, let's say for example that Twitter has an outage, inconveniently timed
for when you call .lists() but by the time you call .private() it would
succeed.  Does it make sense that the db will have the results for .home() and
.mentions() but that's it?

Maybe that's okay, and the error will just get logged, and next time around it
will all sync up.  OTOH, maybe you want all or nothing, or maybe you'd want as
much as will succeed.  I think once you decide, a comment would explain the
rationale for the future reader.

>One problem is that quantal gwibber is using exclusively Twitter API v1,
>which is deprecated and will be going away actually quite soon (good timing
>for this rewrite!). My code is all using the newer API v1.1, so it's not
>always helpful to look at what quantal gwibber does, because the API isn't
>always the same.

So it could be a bug in v1.1 of the API, not present in v1.0, and it's
possible that it either hasn't been reported yet, or we haven't found the bug
report.

>>> +# https://dev.twitter.com/docs/api/1.1/post/statuses/destroy/%3Aid
>>> +    @feature
>>> +    def delete(self, message_id):
>>> +        url = self._api_base.format(endpoint='statuses/destroy/%s' % message_id)
>>
>> Any reason you're using % interpolation here?  Perhaps either use {} and
>> .format(), or just concatenation (e.g.: 'statuses/destroy/' + message_id)?
>
>Yeah, it looks really ugly to me when I see
>'string{}'.format('foo{}'.format('bar')), but I also wanted to allow
>message_id to be an int rather than requiring it be a string, which is what +
>concatenation would have done.
>
>% interpolation just made the int/str conversion easy without being too
>verbose.

Does it make sense to be even more strict about int-i-ness of message_id?

The other option of course is to add more class attributes with endpoints like
the above, so you'd have something like:

    url = self._status_destroy.format(message_id)

>>>       @mock.patch('gwibber.utils.base.Model', TestModel)
>>>       @mock.patch('gwibber.utils.base._seen_messages', {})
>>> +    @mock.patch('gwibber.utils.base._seen_ids', {})
>>
>> Do you really need to mock both _seen_messages and _seen_ids?  What's the
>> difference between these two?  Can they ever get out of sync?  Do we need
>> additional tests to ensure that the constraints of these two data structures
>> are always correct?  Do we need additional tests for when they contain cache
>> hits?
>
>Sorry, I was in a rush when I wrote this. Base._publish now adds to both of
>these so it seemed safer to mock both for consistency, but it's not clear to
>me right off the top of my head whether it's strictly necessary for the test
>itself.

Easy to find out. :)

>>>       def tearDown(self):
>>> +        # Ensure that any log entries we haven't tested just get consumed so
>>> +        # as to isolate out test logger from other tests.
>>>           self.log_mock.stop()
>>
>> The diff doesn't show it, but does setUp() set Base._SYNCHRONIZE = True?
>
>No, I don't need the synchronize stuff because I write the tests to just test
>the synchronous methods directly. I never invoke the thread-starting __call__
>in the tests.

Good point.

>> === modified file 'gwibber/gwibber/utils/base.py'
>> --- gwibber/gwibber/utils/base.py	2012-09-25 20:25:42 +0000
>> +++ gwibber/gwibber/utils/base.py	2012-10-02 14:33:58 +0000
>>> @@ -42,6 +42,7 @@
>>>   # representing the rows matching those keys.  It is used for quickly finding
>>>   # duplicates when we want to insert new rows into the model.
>>>   _seen_messages = {}
>>> +_seen_ids = {}
>>
>> What is _seen_ids for and how is it used differently than _seen_messages?  Do
>> we still need both of these?  If so, can you adjust the comment to explain
>> what these are for?
>
>_seen_ids maps message ids to rowiters. It was the only way I could think to
>allow deletion of messages without iterating over the whole SharedModel
>terribly. Because at delete-time, we only have access to the message_id, not
>the sender & message data that would be required to look up the rowiter in
>_seen_messages.

I think, but you'd have to verify it, that _seen_messages doesn't need to be a
mapping then.  It can be just a set().  IOW, you don't need to map the
extended ids to rowiters, just the message_ids to rowiters.  So the checks
would be something like:

    key = _make_key(args)
    if key in _seen_messages:
        row = Model.get_row(_seen_ids[message_id])
        row[IDS_IDX] = row[IDS_IDX] + args[IDS_IDX]
    else:
        _seen_ids[message_id] = Model.append(*args)
        _seen_messages.add(key)

>>> +    def _unpublish(self, message_id):
>>> +        """Remove message_id from the Dee.SharedModel."""
>>> +        triple = [self.__class__.__name__.lower(),
>>> +                  self._account.id,
>>> +                  message_id]
>>> +
>>> +        row_iter = _seen_ids.pop(message_id)
>>> +        if row_iter is None:
>>> +            log.error('Tried to delete an invalid message id.')
>>> +            return
>>
>> Have you tested the 'miss' logic?  What happens when you .pop() with a
>> missing key?  (this is a trick question :)
>
>Well, I haven't specified a default, which means it will raise KeyError if
>the id is not present. This doesn't particularly bother me, because a) the
>threading logic will safely log and ignore the exception, and b) this can
>only be invoked by clicking on the delete button on a message that's being
>displayed in gwibber, so the message_id being valid is almost *tautological*:
>if message_id wasn't valid, then how the hell did this method even get
>invoked?

Right, but the point is, can row_iter ever actually be None?  As you say, if
you pop a missing message_id, you'll get a KeyError instead of a None return.
Is it even possible to insert a message_id into _seen_ids that has None
instead of an actual rowiter?

Or put another way, no test probably reaches this code, and a coverage tool
would show that.  So, is the test and logging necessary?  If not, and you end
up removing it, please do add a comment above the .pop() that explains the
rational for not checking for KeyErrors, otherwise Future Reader will be
confused.

>> What happens if the .pop() does work?  Does anything need to happen to
>> _seen_messages to keep it in sync?  Isn't the row associated with the
>> row_iter going to be removed?  If so, what happens to the same row_iter
>> sitting as a value in the _seen_messages mapping?
>
>Right, they would become out of sync. But only insofar as there will be a
>sender/message key pointing at a rowiter that points at a row that doesn't
>exist. I'm not certain that this is particularly harmful, beyond being a
>minor memory leak.
>
>I suppose in some extreme case if a person deletes a message and then tries
>to submit an identical message, it would cause us to attempt to append that
>message onto a row that doesn't exist. I don't really see that being a useful
>real-world scenario, because in my mind the use-cases for deleting a message
>would be to either a) leave it deleted, or b) to submit a *different* message
>in it's place.
>
>Anyway, do you know of an easy way to do a dict.pop() that pops based on
>values instead of keys? Or are we going to have to scan the whole dict in an
>inefficient way? Or is there maybe some kind of alternate data structure we
>should be looking at? Some kind two-way dict where both keys and values are
>hashed and indexed?

My concern is that if the two data structures get out of sync, this will lead
to hard to debug errors.  It's a wart asking for trouble.

What I'm thinking is something like this instead, building on my above idea
that _seen_messages would be a set, and the observation that rowiters can
never be None.

        _seen_messages.discard(triple)
        row_iter = _seen_ids.pop(message_id, None)
        if row_iter is None:
            # same as it ever was

>> === modified file 'gwibber/gwibber/utils/download.py'
>> --- gwibber/gwibber/utils/download.py	2012-09-14 20:29:11 +0000
>> +++ gwibber/gwibber/utils/download.py	2012-10-02 14:33:58 +0000
>>> @@ -51,11 +51,11 @@
>>>           url = self.url
>>>           headers = self.headers.copy()
>>>           if self.params is not None:
>>> -            params = urlencode(self.params)
>>> +            params = urlencode(self.params).replace('+', '%20')
>>
>> Why do this?  (Yes, I know you explained it to me, but pretend I'm a
>> first-time reader of the code.  Explain it to them! :).
>
>Because urlencode does not provide any option to use quote() instead of
>quote_plus() which it enforces, but Twitter requires percent-encoded spaces,
>and this is harmless to any other protocol. Also it's easier than
>re-implementing urlencode from scratch. hooray for code reuse!

Right!  It was a trick question, since I really just want that paragraph, or
something very much like it, in a comment above the code, to help our dear
Future Reader.

>>
>>>               if self.post:
>>>                   data = params.encode('utf-8')
>>>                   headers['Content-Type'] = (
>>> -                    'application/x-www-form-urlencoded; charset=utf-8')
>>> +                    'application/x-www-form-urlencoded')
>>
>> Why do this too?  This one you haven't explained even to me! :)
>
>It's because oauthlib is incredibly brittle. It scans for precisely the
>string 'application/x-www-form-urlencoded' and raises an exception if it's
>not exactly that. So it seems this was necessary in order to get OAuth
>signatures to even function.
>
>On second thought however, it's not clear to me what oauthlib actually *does*
>with that header when it's signing (I'm pretty sure the value of that header
>is not included in the base string that gets signed). So it might be
>worthwhile to undo this change and then see if anything breaks. This was just
>one of those things that I did back when I was really struggling to get any
>signature working at all, and then after I got it working, I kept it because
>a) I forgot that I'd done it, and b) it didn't occur to me until just now
>that it might not have been necessary to make the signatures work.
>
>So in other words, I'll test this ;-)

Excellent!