I'm going to review the latest and greatest version of the branch.

First of all, this is fantastic!  You've done a great job at removing tons of
crufty, legacy, mysterious, horrible old code and replaced it with just a
little bit of good, new, readable code.  This is a huge win.  Just to show you
how big, here's the diffstat:

 service/messages.py                     |   47 --
 service/searches.py                     |   87 ---
 service/streams.py                      |   93 ----
 storage/messages.py                     |  172 -------
 storage/searches.py                     |   63 --
 storage/streams.py                      |   68 ---
 testing/com.Gwibber.Messages.service.in |    3 
 testing/com.Gwibber.Searches.service.in |    3 
 testing/com.Gwibber.Streams.service.in  |    3 
 testing/dbus.py                         |    3 
 tests/test_model.py                     |   39 +
 tests/test_protocols.py                 |   37 +
 tests/test_storage.py                   |  688 -------------------------------
 utils/base.py                           |   57 ++
 utils/model.py                          |  100 ++++
 15 files changed, 231 insertions(+), 1232 deletions(-)

You gotta love it when you remove 5x more code than you add, *and* make it so
much more understandable.

One thing I noticed was lots of skipped tests when running under `make
check_all` or a virtualenv.  As we discussed on IRC, you're going to fix
those, and I'll review what you've got so far.

I'll skip the deletions and any uncontroversial new code.


=== modified file 'gwibber/gwibber/tests/test_protocols.py'
--- gwibber/gwibber/tests/test_protocols.py	2012-09-13 17:50:35 +0000
+++ gwibber/gwibber/tests/test_protocols.py	2012-09-18 20:37:18 +0000
> @@ -27,6 +27,7 @@
>  from gwibber.protocols.twitter import Twitter
>  from gwibber.utils.base import Base
>  from gwibber.utils.manager import ProtocolManager
> +from gwibber.utils.model import SharedModel
>  
>  
>  class TestProtocolManager(unittest.TestCase):
> @@ -126,6 +127,42 @@
>                  thread.join()
>          self.assertEqual(my_protocol.result, 'one:two')
>  
> +    def test_publish(self):
> +        base = Base(None)
> +        with self.assertRaises(TypeError):
> +            base._publish(invalid_argument='not good')

I don't think you need the context manager version of this, since you're not
checking anything about the raised exception.  You could possibly check
details about the exception, e.g. the bad column name (more on this later),
but if not, then I think this would be fine:

       self.assertRaises(TypeError, base._publish, invalid_argument='not good')

One other thing we discussed on IRC is that these tests are not well isolated
from the environment.  Because these are not running inside TestDBus, they
actually pollute the user's session.  Other than the problem that running the
test would show bogus data in the Gwibber ui, the other problem is that the
model doesn't get reset after every test, so I bet it persists between test
runs.  Which means it's not clear this is actually testing what you think it's
testing.

The solution is to move all the _publish() tests to TestDBus, since that
ensures it's running on a temporary, isolated session bus, and thus as Ken
says, it should have its own SharedModel.

If TestDBus gets too unwieldy as a result, we'll have to find a way to run
multiple TestCases under the temporary session bus.  (It should be doable, but
maybe not so easily.)

> +
> +        # First message gets through.
> +        self.assertTrue(base._publish(
> +            service='facebook', account_id='1/facebook',
> +            message_id='1234', stream='messages', sender='fred',
> +            sender_nick='freddy', from_me=True, timestamp='today',
> +            message='hello, @jimmy', likes=10, liked=True))
> +
> +        # freddy's duplicate message to Twitter is accepted.
> +        self.assertTrue(base._publish(
> +            service='twitter', account_id='2/twitter',
> +            message_id='5678', stream='messages', sender='fred',
> +            sender_nick='freddy', from_me=True, timestamp='today',
> +            message='hello jimmy', likes=10, liked=False))
> +
> +        # But there is only one message in the model.
> +        self.assertEqual(1, SharedModel.get_n_rows())
> +        self.assertEqual(SharedModel.get_row(SharedModel.get_first_iter())[7],
> +                         'hello, @jimmy')
> +        self.assertEqual(SharedModel.get_row(SharedModel.get_first_iter())[0],
> +            [['facebook', '1/facebook', '1234'],
> +             ['twitter', '2/twitter', '5678']])
> +
> +        # Tedtholomew is allowed to say the same thing in his message though
> +        self.assertTrue(base._publish(
> +            service='facebook', account_id='1/facebook',
> +            message_id='34567', stream='messages', sender='tedtholomew',
> +            sender_nick='teddy', from_me=False, timestamp='today',
> +            message='hello, @jimmy', likes=10, liked=True))
> +
> +        self.assertEqual(2, SharedModel.get_n_rows())
> +


Of course, you know what I'm going to say about these tests. :)

I think these should all be in separate test_*() methods.  Long tests like
this are usually easier to write, but much more difficult to debug when they
fail.  I generally think it's better to have smaller, more isolated tests with
fewer dependencies on earlier results.  Then, when the tests fail later, it's
easier to reason about what could be going wrong.

For example, one test would assert that the data is missing from the table,
then try to _publish() it, asserting a True return value, then assert that the
data existed in the table.

A second test would be very similar, except that it would try to insert the
exact same data, also assert a True return value, but then show that only one
row of the data is in the table.

A possible third test would use the same key as the original row, but
different non-key data.  You'd then show that the second _publish() didn't
overwrite the data of the first.

There are probably variations of the above with separate threads that should
get written, but that will be much trickier, so I think it would be okay to
TODO/XXX them for later.

>      def test_basic_login(self):
>          # Try to log in twice.  The second login attempt returns False because
>          # it's already logged in.

=== modified file 'gwibber/gwibber/utils/base.py'
--- gwibber/gwibber/utils/base.py	2012-09-13 17:50:35 +0000
+++ gwibber/gwibber/utils/base.py	2012-09-18 20:37:18 +0000
> @@ -20,8 +20,25 @@
>      ]
>  
>  
> -from threading import Thread
> -
> +import re
> +import string
> +
> +from threading import Thread, Lock
> +
> +from gwibber.utils.model import SharedModel, COLUMN_NAMES
> +
> +
> +regex_sub = re.compile('[{}\s]'.format(re.escape(string.punctuation))).sub

I think you just impaled some kitten on a rainbow unicorn's horn.

I have no idea what the purpose of this is.  What is this regexp trying to do?
Why is it needed?  My guess is that you want to remove some punctuation from
something, but I have no idea why you'd want to do that.  Plus, one liners
like this are always head-scratchers: "um, oh wait! you're inserting some
escaped characters into a regexp character set with some trailing
whitespace".

> +
> +def strip(message):
> +    """Return a string suitable for fuzzy comparisons."""
> +    message = message.replace('http://', '')
> +    message = message.replace('https://', '')
> +    message = message.replace('gwibber:/', '')
> +    return regex_sub('', message)

This adds to the confusion.  I don't know why "fuzzy comparisons" are
important here, or why the transformations help make the message argument more
suitable for them.

Imagine yourself coming on this code cold, would you understand its purpose?
Pretend you're reverse engineering it all; what explanation would you provide
about what's going on that you wish were there to save you a day of
confusion? :)

I think the purpose of all this is to provide you with a unique key that you
could use to detect duplicate rows.  Am I right?  If so, then maybe also
rename strip() to something more related to its purpose, rather than its
implementation.

> +
> +
> +publish_lock = Lock()

Since this is private to the module, _publish_lock is probably a better name.

>  
>  class Base:
>      class info:
> @@ -53,6 +70,42 @@
>          thread.daemon = True
>          thread.start()
>  
> +    def _publish(self, service, account_id, message_id, **kwargs):

This means that you can never have columns called 'service', 'account_id', or
'message_id', right?  Do you think that will ever be a problem in practice?

The fundamental issue is that you've got a namespace collision.  This is a not
uncommon pattern, and there are various ways out of it.

First, you can try to choose argument names that *really* will never be used
as column names.  An example might be names that start with an underscore
(that's probably a good enough rule to be sure to follow in libgwibber).

Another thing you can do is reduce the number of possibly offending
arguments.  E.g. you could combine those three elements into a 3-tuple
argument such as this:

    def _publish(self, _key, **kwargs):
        """...

        :param _key: A 3-tuple of the format (service, account_id, message_id)
            which represents the key into a row of the table.  This argument
            is passed as a 3-tuple so as to avoid possible future name clashes
            with column names.
        """
        service, account_id, message_id = _key

> +        """Publish fresh data into the stream, without duplicating anything.
> +
> +        Returns True if the message was appended, or if the message was
> +        already present, returns False if unable to append new messagee.

s/messagee/messages/

> +        """
> +        message = COLUMN_NAMES.index('message')
> +        sender = COLUMN_NAMES.index('sender')
> +        ids = COLUMN_NAMES.index('message_ids')

These are constants, right?  Perhaps you can calculate them in model.py and
just import them here?  Perhaps with that are a little more evocative of what
they are, e.g. MESSAGE_COLUMN or somesuch?  Or, see below for another
suggestion.

> +
> +        args = [''] * len(COLUMN_NAMES)
> +        args[ids] = [[service, account_id, message_id]]

Since this will be confusing to someone when they first read this, you might
refer them to the comment in model.py about why this column is a nested
array.

> +
> +        # The basic idea is that you call _publish() with keyword
> +        # arguments only, making the calling code a bit more explicit,
> +        # and then this converts it from an unordered dict into the
> +        # ordered list required by SharedModel.append().
> +        for column_name, value in kwargs.items():
> +            try:
> +                args[COLUMN_NAMES.index(column_name)] = value

.index() is actually not a cheap call to make, because it actually has to do a
linear search of the sequence every time.  Since this is all constant anyway,
you could calculate one more object in model.py and import it here.  Create a
mapping from column name to index like so:

COLUMN_INDICES = {name: i for i, name in enumerate(COLUMN_NAMES)}

Now, here, and in the code above, all you would need is:

        bad_kwargs = []
        for column_name, value in kwargs.items():
            index = COLUMN_INDICES.get(column_name)
            if index is None:
                bad_kwargs.append(column_name)
            else:
                args[index] = value
        if len(bad_kwargs) > 0:
            raise TypeError('Unexpected keyword arguments: {}'.format(
                            COMMASPACE.join(bad_kwargs)))

Why do I do it like that?  So that if more than one bad keyword argument is
given, you'll see them all the first time you hit this code.

I'm going to make one other suggestion about how to do all this, and I'll let
you decide which you like.

        args = []
        for column_name, type_code in SCHEMA:
            # This one is special because blah blah blah.
            if column_name == 'message_ids':
                args.append([[service, account_id, message_id]])
            else:
                # If not given, the column's value is an empty string.
                args.append(kwargs.pop(column_name, ''))
        # All keyword arguments must correspond to column names, so any
        # leftover arguments are not allowed.
        if len(kwargs) > 0:
            raise TypeError('Unexpected keyword arguments: {}'.format(
                            COMMASPACE.join(kwargs)))

{}.pop() is kind of nice, isn't it? :)

> +            except ValueError:
> +                raise TypeError(
> +                    "_publish() got an unexpected keyword argument '{}'".format(
> +                        column_name))
> +
> +        with publish_lock:
> +            # Don't let duplicate messages into the model, but do
> +            # record the unique message ids of each duplicate message.
> +            stripped = strip(args[sender] + args[message])
> +            for row in SharedModel:
> +                if stripped == strip(row[sender] + row[message]):
> +                    row[ids] = row[ids] + args[ids]
> +                    return True

This code just confuses the hell out of me.  I had to re-read it several
times, and I'm still not sure why the concatenation occurs.  Is that tested?

Maybe one way to make it more intelligible would be to rename strip() to
unique_key() and pass in the entire row.  The you might see something like:

            new_key = unique_key(args)
            for row in SharedModel:
                if new_key == unique_key(row):
                    # There is a row collision.  Discard the remaining data,
                    # but append the new message ids so that the front end can
                    # display all the sources of this data.
                    row[MESSAGE_IDS_INDEX].extend(args[MESSAGE_IDS_INDEX])

Also, is there no more efficient way of asking whether the key already exists
in the database than iterating through all the rows?  Couldn't that get
extremely inefficient?

> +
> +            return bool(SharedModel.append(*args))

What does SharedModel.append() actually return?

> +
>      def _login(self):
>          """Prevent redundant login attempts."""
>          # The first time the user logs in, we expect old_token to be None.

=== added file 'gwibber/gwibber/utils/model.py'
--- gwibber/gwibber/utils/model.py	1970-01-01 00:00:00 +0000
+++ gwibber/gwibber/utils/model.py	2012-09-18 20:37:18 +0000
> @@ -0,0 +1,100 @@
> +# Copyright (C) 2012 Canonical Ltd
> +#
> +# This program is free software: you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License version 2 as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program.  If not, see <http://www.gnu.org/licenses/>.
> +
> +"""The Dee.SharedModel interface layer.
> +
> +Dee.SharedModel is comparable to a Gtk.ListStore, except that it
> +shares its state between processes using DBus. When gwibber-service
> +downloads new messages from a website, it inserts those messages into
> +this SharedModel instance, which then triggers a callback in the Vala
> +frontend, which knows to display the new messages there.
> +"""
> +
> +
> +__all__ = [
> +    'SharedModel',
> +    'COLUMN_NAMES',
> +    'COLUMN_TYPES',
> +    ]
> +
> +
> +from gi.repository import Dee
> +
> +
> +# Most of this schema is very straightforward, but the 'service'

I think you mean s/service/message_ids/ right?

> +# column needs a bit of explanation:
> +#
> +# It is a two-dimensional array (ie, an array of arrays). Each inner
> +# array contains three elements: the name of the protocol, the
> +# account_id (like '6/flickr' or '3/facebook'), followed by the
> +# message_id for that particular service.
> +#
> +# Then, there will be one of these triples present for every service
> +# on which the message exists. So for example, if the user posts the
> +# same message to both facebook and twitter, that message will appear
> +# as a single row in this schema, and the 'service' column will look
> +# something like this:
> +#
> +# [
> +#     ['facebook', '2/facebook', '12345'],
> +#     ['twitter', '3/twitter', '987654'],
> +# ]

Could you add an explanation of *why* the inner one is an array?  I vaguely
recall you and Ken discussing that this could be a hash of the concatenated
values.  A comment describing why this design was chosen would be helpful I
think.

> +SCHEMA = (
> +    ('message_ids',    'aas'),
> +    ('stream',         's'),
> +    ('transient',      's'),
> +    ('sender',         's'),
> +    ('sender_nick',    's'),
> +    ('from_me',        'b'),
> +    ('timestamp',      's'),
> +    ('message',        's'),
> +    ('html',           's'),
> +    ('icon_uri',       's'),
> +    ('url',            's'),
> +    ('source',         's'),
> +    ('timestring',     's'),
> +    ('reply_nick',     's'),
> +    ('reply_name',     's'),
> +    ('reply_url',      's'),
> +    ('likes',          'd'),
> +    ('liked',          'b'),
> +    ('retweet_nick',   's'),
> +    ('retweet_name',   's'),
> +    ('retweet_id',     's'),
> +    ('link_picture',   's'),
> +    ('link_name',      's'),
> +    ('link_url',       's'),
> +    ('link_desc',      's'),
> +    ('link_caption',   's'),
> +    ('link_icon',      's'),
> +    ('img_url',        's'),
> +    ('img_src',        's'),
> +    ('img_thumb',      's'),
> +    ('img_name',       's'),
> +    ('video_pic',      's'),
> +    ('video_src',      's'),
> +    ('video_url',      's'),
> +    ('video_name',     's'),
> +    ('comments',       's'),
> +    ('recipient',      's'),
> +    ('recipient_nick', 's'),
> +    ('recipient_icon', 's'),
> +    )
> +
> +
> +COLUMN_NAMES, COLUMN_TYPES = zip(*SCHEMA)
> +
> +
> +SharedModel = Dee.SharedModel.new('com.Gwibber.Streams')
> +SharedModel.set_schema_full(COLUMN_TYPES)