On 12-09-18 05:16 PM, Barry Warsaw wrote: > First of all, this is fantastic! You've done a great job at removing tons of > crufty, legacy, mysterious, horrible old code and replaced it with just a > little bit of good, new, readable code. This is a huge win. Thanks. I can't take all the credit though... somebody had to write all this terrible code in the first place! Lots of low-hanging fruit on this project ;-) > Just to show you how big, here's the diffstat: [snip] > 15 files changed, 231 insertions(+), 1232 deletions(-) This is actually an old diffstat! The final total was actually 1550 lines deleted ;-) > You gotta love it when you remove 5x more code than you add, *and* make it so > much more understandable. Heh. It was 10x more, but then I expanded the tests... ;-) > One thing I noticed was lots of skipped tests when running under `make > check_all` or a virtualenv. As we discussed on IRC, you're going to fix > those, and I'll review what you've got so far. Yep, that's already fixed and pushed. >> + def test_publish(self): >> + base = Base(None) >> + with self.assertRaises(TypeError): >> + base._publish(invalid_argument='not good') > > I don't think you need the context manager version of this, since you're not > checking anything about the raised exception. You could possibly check > details about the exception, e.g. the bad column name (more on this later), > but if not, then I think this would be fine: > > self.assertRaises(TypeError, base._publish, invalid_argument='not good') Fair. I had forgotten about your proposed syntax and was thinking that the with statement was the only way to test for exceptions. You can change this if you feel strongly about it, but I'm indifferent on this point. > One other thing we discussed on IRC is that these tests are not well isolated > from the environment. Because these are not running inside TestDBus, they > actually pollute the user's session. Other than the problem that running the > test would show bogus data in the Gwibber ui, the other problem is that the > model doesn't get reset after every test, so I bet it persists between test > runs. Which means it's not clear this is actually testing what you think it's > testing. Yes, I know. It's not clear to me exactly how this will fit into TestDBus though, because SharedModel is not like our other DBus classes where we are defining the classes ourselves and have a lot of control over them. I have no idea if SharedModel will respect the custom sessionbus that TestDBus creates, or how to ensure that that is happening. I was hoping you'd offer a bit more guidance in this aspect. Also, on some level we need to understand that we don't need to test SharedModel itself, but just the code that we use to interface with it. To that end, we might almost be better off leaving it outside of TestDBus, and simply create a new SharedModel that uses a different bus name, the mock that in place of the 'real' sharedmodel. At least as far as testing Base._publish goes, anyway. I'll play with this some more tonight I suppose. > The solution is to move all the _publish() tests to TestDBus, since that > ensures it's running on a temporary, isolated session bus, and thus as Ken > says, it should have its own SharedModel. How does TestDBus set up this custom sessionbus? Is it done in such a way that importing SharedModel will force it to use the custom bus and not the real bus? Because SharedModel handles it's own dbus logic internally, I don't see any way to tell SharedModel 'hey, use this other bus instead' unless your TestDBus magic is somehow affecting the dbus module itself in a process-global way. > If TestDBus gets too unwieldy as a result, we'll have to find a way to run > multiple TestCases under the temporary session bus. (It should be doable, but > maybe not so easily.) TestDBus was decimated by the code removals, so a small addition should still result in a net decrease in size. I wouldn't worry about that so much at this point. > I think these should all be in separate test_*() methods. Long tests like > this are usually easier to write, but much more difficult to debug when they > fail. I generally think it's better to have smaller, more isolated tests with > fewer dependencies on earlier results. Then, when the tests fail later, it's > easier to reason about what could be going wrong. > > For example, one test would assert that the data is missing from the table, > then try to _publish() it, asserting a True return value, then assert that the > data existed in the table. You're right, I did forget to write a test that asserts that the model starts off empty. >> +regex_sub = re.compile('[{}\s]'.format(re.escape(string.punctuation))).sub > > I think you just impaled some kitten on a rainbow unicorn's horn. Oh barry! Regex is not our enemy! > I have no idea what the purpose of this is. What is this regexp trying to do? It defines a basic matcher that says 'match all punctuation, and whitespace too'. I was hoping the tests would do a better job of demonstrating what this was for. Did you notice that the second 'message' that was rejected as a duplicate actually had different punctuation than the first one that was accepted? This regex is used in the following function to strip out all punctuation and spaces from a string. > Why is it needed? My guess is that you want to remove some punctuation from > something, but I have no idea why you'd want to do that. Plus, one liners > like this are always head-scratchers: "um, oh wait! you're inserting some > escaped characters into a regexp character set with some trailing > whitespace". > >> + >> +def strip(message): >> + """Return a string suitable for fuzzy comparisons.""" >> + message = message.replace('http://', '') >> + message = message.replace('https://', '') >> + message = message.replace('gwibber:/', '') >> + return regex_sub('', message) > > This adds to the confusion. I don't know why "fuzzy comparisons" are > important here, or why the transformations help make the message argument more > suitable for them. Hey man, this is still better than the code that it replaced. The 'message.replace' lines I don't fully understand, but they were insisted upon by kenvandine, because that's what libgwibber is currently doing. The idea is that, we don't want duplicate messages to show up in the stream of messages that are visible to the user. But different social media sites attach different semantic meanings to different punctuation marks, so we want to ignore those for the sake of determining whether one message is actually identical to another or not. So we need to strip them out for the sake of comparing the strings. Consider this example: Fred uses Gwibber to post identical messages on Twitter and Google+ (pretend that we support G+ for a moment). Fred writes "Hey jimbob, been to http://example.com lately?", and this message might show up on Twitter like "Hey @jimbob, been to example.com lately?", but it might show up on G+ like "Hey +jimbob, been to http://example.com lately?". So we need to strip out all the possibly-different bits in order to identify that these messages are exactly the same. So in both those cases above, the string is converted into "Heyjimbobbeentoexamplecomlately" and then they compare equally, so we've identified a duplicate message. > Imagine yourself coming on this code cold, would you understand its purpose? > Pretend you're reverse engineering it all; what explanation would you provide > about what's going on that you wish were there to save you a day of > confusion? :) Ok, so it needs some better comments explaining it, but it had to be done with regex for performance reasons. Using a list comprehension to build a string was shown to be 3x slower[1], and this function gets executed once for each message in the stream (so it is O(n)). Just wait until there are a lot of messages present, and then you'll be sorry ;-) [1] http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python/266162#266162 > I think the purpose of all this is to provide you with a unique key that you > could use to detect duplicate rows. Am I right? If so, then maybe also > rename strip() to something more related to its purpose, rather than its > implementation. You got it exactly right! It's not so hard to read :-P If you feel it's really that hard to read, we can change it to the list comprehension idiom instead (see 'test_set' in the above link), but not without doing some kind of cache/memoizer to stop it having to run so many times. Actually strip() is a good candidate for memoization because it is called repeatedly with the same output, doesn't depend on any outside environment other than what is passed to it, and really does need to be fast (otherwise there will be big delays trying to add new messages to the stream). >> +publish_lock = Lock() > > Since this is private to the module, _publish_lock is probably a better name. Ok. >> + def _publish(self, service, account_id, message_id, **kwargs): > > This means that you can never have columns called 'service', 'account_id', or > 'message_id', right? Do you think that will ever be a problem in practice? Nope, I discussed this with kenvandine. He specifically said that message_id could never possibly be a column in the schema (and he didn't say it explicitly, but his reasoning does extend to the other names, too). The reason is that those three things are not unique to each row, the first 'aas' column contains this information instead. > The fundamental issue is that you've got a namespace collision. This is a not > uncommon pattern, and there are various ways out of it. Nope, no namespace collision. Think of these as _required_ schema columns. They just couldn't be defined explicitly in the schema because 'aas' is an inherently vague data type. These three keywords are the names of the three strings that live in the 'aas' array-of-arrays. > First, you can try to choose argument names that *really* will never be used > as column names. An example might be names that start with an underscore > (that's probably a good enough rule to be sure to follow in libgwibber). They already are column names. They are required column names. It is the same namespace :-P >> + message = COLUMN_NAMES.index('message') >> + sender = COLUMN_NAMES.index('sender') >> + ids = COLUMN_NAMES.index('message_ids') > > These are constants, right? Perhaps you can calculate them in model.py and > just import them here? Perhaps with that are a little more evocative of what > they are, e.g. MESSAGE_COLUMN or somesuch? Or, see below for another > suggestion. Oh yeah, good point. >> + args = [''] * len(COLUMN_NAMES) >> + args[ids] = [[service, account_id, message_id]] > > Since this will be confusing to someone when they first read this, you might > refer them to the comment in model.py about why this column is a nested > array. Ok. >> + # The basic idea is that you call _publish() with keyword >> + # arguments only, making the calling code a bit more explicit, >> + # and then this converts it from an unordered dict into the >> + # ordered list required by SharedModel.append(). >> + for column_name, value in kwargs.items(): >> + try: >> + args[COLUMN_NAMES.index(column_name)] = value > > .index() is actually not a cheap call to make, because it actually has to do a > linear search of the sequence every time. Since this is all constant anyway, > you could calculate one more object in model.py and import it here. Create a > mapping from column name to index like so: > > COLUMN_INDICES = {name: i for i, name in enumerate(COLUMN_NAMES)} > > Now, here, and in the code above, all you would need is: > > bad_kwargs = [] > for column_name, value in kwargs.items(): > index = COLUMN_INDICES.get(column_name) > if index is None: > bad_kwargs.append(column_name) > else: > args[index] = value > if len(bad_kwargs) > 0: > raise TypeError('Unexpected keyword arguments: {}'.format( > COMMASPACE.join(bad_kwargs))) > > Why do I do it like that? So that if more than one bad keyword argument is > given, you'll see them all the first time you hit this code. Good call. > I'm going to make one other suggestion about how to do all this, and I'll let > you decide which you like. > > args = [] > for column_name, type_code in SCHEMA: > # This one is special because blah blah blah. > if column_name == 'message_ids': > args.append([[service, account_id, message_id]]) > else: > # If not given, the column's value is an empty string. > args.append(kwargs.pop(column_name, '')) > # All keyword arguments must correspond to column names, so any > # leftover arguments are not allowed. > if len(kwargs) > 0: > raise TypeError('Unexpected keyword arguments: {}'.format( > COMMASPACE.join(kwargs))) > > {}.pop() is kind of nice, isn't it? :) Yeah, dict.pop is fun, but overall I don't like this one. 'if' statements that only trigger once inside a loop really bother me, for performance reasons. You have to test that test every single time, and it's only ever true the first time. >> + with publish_lock: >> + # Don't let duplicate messages into the model, but do >> + # record the unique message ids of each duplicate message. >> + stripped = strip(args[sender] + args[message]) >> + for row in SharedModel: >> + if stripped == strip(row[sender] + row[message]): >> + row[ids] = row[ids] + args[ids] >> + return True > > This code just confuses the hell out of me. I had to re-read it several > times, and I'm still not sure why the concatenation occurs. Is that tested? The concatenation joins together the sender's name with his message for comparison. If you leave out the sender's name, then you prevent two different people from being able to say the same thing. Like Alice tweets 'hi' and then Bob tweets 'hi' and Bob's message is lost because Alice said it first. By concatenating, you then are comparing 'Alicehi' to 'Bobhi' and you see that they are different messages, so they get their own rows in the model. > Maybe one way to make it more intelligible would be to rename strip() to > unique_key() and pass in the entire row. The you might see something like: Nope, you can't pass in the whole row because then the service name becomes part of the string, and it no longer is able to realize that 'Alicehi' on twitter is the same message as 'Alicehi' on facebook, because 'twitterAlicehi' is a different string than 'facebookAlicehi'. It turns out that concatenating the sender's name and message text *only* is the correct approach for fuzzy equivalence testing ;-) > new_key = unique_key(args) > for row in SharedModel: > if new_key == unique_key(row): > # There is a row collision. Discard the remaining data, > # but append the new message ids so that the front end can > # display all the sources of this data. > row[MESSAGE_IDS_INDEX].extend(args[MESSAGE_IDS_INDEX]) > > Also, is there no more efficient way of asking whether the key already exists > in the database than iterating through all the rows? Couldn't that get > extremely inefficient? That's a good question, I'm not sure. The problem is that Dee.SharedModel is kind of a custom gobjecty thing, and it's not well documented, and I'm not sure how good the python overrides are (it has some but seemingly not all of the methods you would expect to be able to treat it as a true iterable). I was hoping that we might just maintain a global set() instance that can contain pre-stripped strings, and then we can just test for membership there, but then we still need the ability to append to the message_ids 'aas' in the event that we are dealing with a duplicate. I suppose what we really need then is a global dict() instance, where the keys are our stripped strings, and the values are DeeModelIters, such that we can get the DeeModelIter back easily and then modify that row when we are adding a duplicate message. >> + return bool(SharedModel.append(*args)) > > What does SharedModel.append() actually return? A DeeModelIter, which can be used to modify the row later. Very much like a GtkListStore if you're familiar with those. I'm not actually aware of what might cause .append to fail, but it just seemed like the smart thing to do to be able to report a failure as False, but then again it's not actually clear to me that any other code is going to check or care about the return value from Base._publish. >> +# Most of this schema is very straightforward, but the 'service' > > I think you mean s/service/message_ids/ right? Yep, changed the name and forgot to update the comment. >> +# ['facebook', '2/facebook', '12345'], >> +# ['twitter', '3/twitter', '987654'], >> +# ] > > Could you add an explanation of *why* the inner one is an array? I vaguely > recall you and Ken discussing that this could be a hash of the concatenated > values. A comment describing why this design was chosen would be helpful I > think. No, you're getting two different conversations confused. the 'aas' thing used to be an 'as' from years ago, when 'aas' was buggy/broken in python. The string was concatenated with colons, and thus looked like 'twitter:3/twitter:987654' and then libgwibber had to do a bunch of expensive string manipulations on it, stupidly. ken and I agreed that 'aas' is actually the correct data type for the information being contained, the previous 'as' was just an ugly hack workaround for a bug that was fixed long ago. The hashing thing that ken and I were talking about was in regards to the strip function above. ken mentioned wanting the comparison done on hashes instead of on stripped strings, but then he showed me the existing vala code (which doesn't do any hashing) and I didn't see any reason to add any hashing to that algorithm, so I didn't bother). >> +SCHEMA = ( >> + ('message_ids', 'aas'), >> + ('stream', 's'), >> + ('transient', 's'), >> + ('sender', 's'), >> + ('sender_nick', 's'), >> + ('from_me', 'b'), >> + ('timestamp', 's'), >> + ('message', 's'), >> + ('html', 's'), >> + ('icon_uri', 's'), >> + ('url', 's'), >> + ('source', 's'), >> + ('timestring', 's'), >> + ('reply_nick', 's'), >> + ('reply_name', 's'), >> + ('reply_url', 's'), >> + ('likes', 'd'), >> + ('liked', 'b'), >> + ('retweet_nick', 's'), >> + ('retweet_name', 's'), >> + ('retweet_id', 's'), >> + ('link_picture', 's'), >> + ('link_name', 's'), >> + ('link_url', 's'), >> + ('link_desc', 's'), >> + ('link_caption', 's'), >> + ('link_icon', 's'), >> + ('img_url', 's'), >> + ('img_src', 's'), >> + ('img_thumb', 's'), >> + ('img_name', 's'), >> + ('video_pic', 's'), >> + ('video_src', 's'), >> + ('video_url', 's'), >> + ('video_name', 's'), >> + ('comments', 's'), >> + ('recipient', 's'), >> + ('recipient_nick', 's'), >> + ('recipient_icon', 's'), >> + ) >> + >> + >> +COLUMN_NAMES, COLUMN_TYPES = zip(*SCHEMA) >> + >> + >> +SharedModel = Dee.SharedModel.new('com.Gwibber.Streams') >> +SharedModel.set_schema_full(COLUMN_TYPES) > >