On Sep 18, 2012, at 11:48 PM, Robert Bruce Park wrote: >This is actually an old diffstat! The final total was actually 1550 >lines deleted ;-) Happy happy, joy joy! >> I don't think you need the context manager version of this, since you're not >> checking anything about the raised exception. You could possibly check >> details about the exception, e.g. the bad column name (more on this later), >> but if not, then I think this would be fine: >> >> self.assertRaises(TypeError, base._publish, invalid_argument='not good') > >Fair. I had forgotten about your proposed syntax and was thinking that >the with statement was the only way to test for exceptions. You can >change this if you feel strongly about it, but I'm indifferent on this >point. Right, the context manager is really useful if you want to make additional assertions about the exception that was raised. For example, let's say you raised some Gwibber-specific exception, and you stored the set of invalid arguments on an attribute of that exception. You could add an assertion that the attribute had the set of strings you expected it to have, and you'd have to use the context manager syntax to do that. I'm not suggesting this as a change, just illustrating an example (which I know you understand, just for anyone else following along). >> One other thing we discussed on IRC is that these tests are not well >> isolated from the environment. Because these are not running inside >> TestDBus, they actually pollute the user's session. Other than the problem >> that running the test would show bogus data in the Gwibber ui, the other >> problem is that the model doesn't get reset after every test, so I bet it >> persists between test runs. Which means it's not clear this is actually >> testing what you think it's testing. > >Yes, I know. It's not clear to me exactly how this will fit into >TestDBus though, because SharedModel is not like our other DBus classes >where we are defining the classes ourselves and have a lot of control >over them. I have no idea if SharedModel will respect the custom >sessionbus that TestDBus creates, or how to ensure that that is >happening. I was hoping you'd offer a bit more guidance in this aspect. I don't know a lot about SharedModel yet, but Ken mentioned on IRC that using a temporary session bus will isolate the model changes. This should be fairly easy to verify, though the minimal documentation I've found on Dee doesn't really say. I'd be more worried about any caching polluting the user's environment. We'll just have to hack on this to make sure. >Also, on some level we need to understand that we don't need to test >SharedModel itself, but just the code that we use to interface with it. >To that end, we might almost be better off leaving it outside of >TestDBus, and simply create a new SharedModel that uses a different bus >name, the mock that in place of the 'real' sharedmodel. At least as far >as testing Base._publish goes, anyway. It probably does make sense to have two types of tests, one for outside-dbus which mocks the SharedModel and one inside-dbus which uses the real shared model on a temporary bus (which TestDBus already sets up and manages). The former tests can be much more extensive than the latter, but there should be a few really-over-dbus tests of _publish() just to ensure that updates to the model get expressed in peers of the SharedModel. It would be more work, but it might also be possible to write the tests once, and run them both inside and outside dbus. Probably not worth it though. >I'll play with this some more tonight I suppose. Cool. >> The solution is to move all the _publish() tests to TestDBus, since that >> ensures it's running on a temporary, isolated session bus, and thus as Ken >> says, it should have its own SharedModel. > >How does TestDBus set up this custom sessionbus? It does this by running gwibber-service with a specially crafted environment, set up by the dbus-daemon executable. If you look in gwibber/testing/dbus-session.conf.in, you'll see a template for a dbus session bus configuration file. The trick here is that this is telling the session bus to listen on a unix socket in a temporary directory. The test suite fills in this tempdir on the fly and then executes dbus-daemon, pointing it to this config file. It also puts the com.Gwibber.*.service files in this directory so that the isolated dbus-daemon will know what executable to run when one of those services is requested. Those all point to the gwibber-service executable in the virtualenv's bin dir, so that ensures that we're testing against the service we're developing and not the one installed in the system, and it also explains why the virtualenv must be set up in order for the dbus tests to not be skipped. The last piece of the puzzle is that bin/gwibber-service no longer actually exists; it's created by running `python3 setup.py install` since it's defined as a console_script entry point. This also explains why we don't currently test any signals, because the foreground process, which is running the Python test suite, does *not* connect to the temporary session bus, it connects to the actual user session bus. So we would have to either re-exec the test suite under the dbus-daemon regime described above, or more likely, write a small test script that also ran under the temporary session dbus to react to those signals. It's all rather more complication than I've had time to do, but I think it's important to do at some point, because those signals are untested, and untested code is broken code. :) >Is it done in such a way that importing SharedModel will force it to use the >custom bus and not the real bus? Only in the gwibber-service executed by dbus-daemon described above. >Because SharedModel handles it's own dbus logic internally, I don't see any >way to tell SharedModel 'hey, use this other bus instead' unless your >TestDBus magic is somehow affecting the dbus module itself in a >process-global way. Yep, it is. `M-x man RET dbus-daemon RET` will provide additional information. Essentially, it's dbus-daemon that's doing all the magic that the dbus library needs to run gwibber-service on an isolated session bus. >> If TestDBus gets too unwieldy as a result, we'll have to find a way to run >> multiple TestCases under the temporary session bus. (It should be doable, >> but maybe not so easily.) > >TestDBus was decimated by the code removals, so a small addition should >still result in a net decrease in size. I wouldn't worry about that so >much at this point. Excellent. >>> +regex_sub = re.compile('[{}\s]'.format(re.escape(string.punctuation))).sub >> >> I think you just impaled some kitten on a rainbow unicorn's horn. > >Oh barry! Regex is not our enemy! For a fun bit of internet lore: http://regex.info/blog/2006-09-15/247 >> I have no idea what the purpose of this is. What is this regexp trying to >> do? > >It defines a basic matcher that says 'match all punctuation, and >whitespace too'. Right, I get that. What I'm trying to point out is that it may be difficult to figure out for another person who is not steeped as deeply in this code as we are. A use case: Ubuntu bugs are often fixed by people who are very smart but don't have deep application domain knowledge about the code they are trying to fix. The more we make them have to think and reason, the more difficult it will be for them to fix a bug that is affecting our users. This isn't to say that clever or convenient or cryptic code should never be used, but IMHO we should always keep this drive-by bug-fixer in mind. I've been that person many times so I know how painful it can be when you're faced with a critical bug in some deep code. If we can help this future Ubuntu bug fixer figure out the code in half the time, then maybe she can fix two bugs in a day rather than just one. So when I'm writing or reviewing code, I'm always trying to think about this person, and trying to anticipate where they will get bogged down. Sometimes their life can be made easier by rewriting the code to a more obvious construct, but of course there are other trade-offs, so sometimes a 5 line comment explaining both the what and the why of it, will do the trick. It's difficult for everybody intimately familiar with the code, algorithms, off-line discussions, whiteboard diagrams, UDS sessions, etc. to step back and adopt the beginner's mind. But I do think it's a mindset worth cultivating. >I was hoping the tests would do a better job of demonstrating what this >was for. Did you notice that the second 'message' that was rejected as a >duplicate actually had different punctuation than the first one that was >accepted? Actually, no I missed that detail. This is one reason why I prefer really focused tests. A test called test_message_with_different_punctuation_is_a_duplicate() - or some such reasonable name - would make that much more obvious. >This regex is used in the following function to strip out all punctuation and >spaces from a string. A wonderful comment! >> Why is it needed? My guess is that you want to remove some punctuation from >> something, but I have no idea why you'd want to do that. Plus, one liners >> like this are always head-scratchers: "um, oh wait! you're inserting some >> escaped characters into a regexp character set with some trailing >> whitespace". Please note that this part of the review is me with my drive-by Ubuntu bug fixer hat on. I'm sorry, I'll try to make that more obvious in the future. Regexps aren't *necessarily* evil, despite the above link. :) >>> + >>> +def strip(message): >>> + """Return a string suitable for fuzzy comparisons.""" >>> + message = message.replace('http://', '') >>> + message = message.replace('https://', '') >>> + message = message.replace('gwibber:/', '') >>> + return regex_sub('', message) >> >> This adds to the confusion. I don't know why "fuzzy comparisons" are >> important here, or why the transformations help make the message argument more >> suitable for them. > >Hey man, this is still better than the code that it replaced. The >'message.replace' lines I don't fully understand, but they were insisted >upon by kenvandine, because that's what libgwibber is currently doing. Right, we just need to make that evident in order to give the drive-byer a clue as to what's going on, and where possibly to look for future reference. >The idea is that, we don't want duplicate messages to show up in the >stream of messages that are visible to the user. But different social >media sites attach different semantic meanings to different punctuation >marks, so we want to ignore those for the sake of determining whether >one message is actually identical to another or not. So we need to strip >them out for the sake of comparing the strings. Beautiful! I think I'll just cut and paste this right into a comment. :) >Consider this example: > >Fred uses Gwibber to post identical messages on Twitter and Google+ >(pretend that we support G+ for a moment). Fred writes "Hey jimbob, been >to http://example.com lately?", and this message might show up on >Twitter like "Hey @jimbob, been to example.com lately?", but it might >show up on G+ like "Hey +jimbob, been to http://example.com lately?". So >we need to strip out all the possibly-different bits in order to >identify that these messages are exactly the same. So in both those >cases above, the string is converted into >"Heyjimbobbeentoexamplecomlately" and then they compare equally, so >we've identified a duplicate message. > >> Imagine yourself coming on this code cold, would you understand its >> purpose? Pretend you're reverse engineering it all; what explanation would >> you provide about what's going on that you wish were there to save you a >> day of confusion? :) > >Ok, so it needs some better comments explaining it, but it had to be >done with regex for performance reasons. Using a list comprehension to >build a string was shown to be 3x slower[1], and this function gets >executed once for each message in the stream (so it is O(n)). Just wait >until there are a lot of messages present, and then you'll be sorry ;-) :) >[1] >http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python/266162#266162 Cool. As I sucked in saying originally, it's not so much the regexp that I'm concerned with, it's the comprehension of both the what and the why but the future drive-byer that I want to improve. >> I think the purpose of all this is to provide you with a unique key that you >> could use to detect duplicate rows. Am I right? If so, then maybe also >> rename strip() to something more related to its purpose, rather than its >> implementation. > >You got it exactly right! It's not so hard to read :-P > >If you feel it's really that hard to read, we can change it to the list >comprehension idiom instead (see 'test_set' in the above link), but not >without doing some kind of cache/memoizer to stop it having to run so >many times. Actually strip() is a good candidate for memoization because >it is called repeatedly with the same output, doesn't depend on any >outside environment other than what is passed to it, and really does >need to be fast (otherwise there will be big delays trying to add new >messages to the stream). We can keep the regexp. >>> + def _publish(self, service, account_id, message_id, **kwargs): >> >> This means that you can never have columns called 'service', 'account_id', >> or 'message_id', right? Do you think that will ever be a problem in >> practice? > >Nope, I discussed this with kenvandine. He specifically said that >message_id could never possibly be a column in the schema (and he didn't >say it explicitly, but his reasoning does extend to the other names, >too). The reason is that those three things are not unique to each row, >the first 'aas' column contains this information instead. Okay. If Ken says that's the case, then it's safe. The important thing here to understand is that you cannot call this method with the explicit keyword arguments given both positionally, and in the **kwargs. Python will raise a TypeError in that case. >> The fundamental issue is that you've got a namespace collision. This is a >> not uncommon pattern, and there are various ways out of it. > >Nope, no namespace collision. Think of these as _required_ schema >columns. They just couldn't be defined explicitly in the schema because >'aas' is an inherently vague data type. These three keywords are the >names of the three strings that live in the 'aas' array-of-arrays. > >> First, you can try to choose argument names that *really* will never be used >> as column names. An example might be names that start with an underscore >> (that's probably a good enough rule to be sure to follow in libgwibber). > >They already are column names. They are required column names. It is the >same namespace :-P Sounds great. >> bad_kwargs = [] >> for column_name, value in kwargs.items(): >> index = COLUMN_INDICES.get(column_name) >> if index is None: >> bad_kwargs.append(column_name) >> else: >> args[index] = value >> if len(bad_kwargs) > 0: >> raise TypeError('Unexpected keyword arguments: {}'.format( >> COMMASPACE.join(bad_kwargs))) >> >> Why do I do it like that? So that if more than one bad keyword argument is >> given, you'll see them all the first time you hit this code. > >Good call. > >> I'm going to make one other suggestion about how to do all this, and I'll let >> you decide which you like. >> >> args = [] >> for column_name, type_code in SCHEMA: >> # This one is special because blah blah blah. >> if column_name == 'message_ids': >> args.append([[service, account_id, message_id]]) >> else: >> # If not given, the column's value is an empty string. >> args.append(kwargs.pop(column_name, '')) >> # All keyword arguments must correspond to column names, so any >> # leftover arguments are not allowed. >> if len(kwargs) > 0: >> raise TypeError('Unexpected keyword arguments: {}'.format( >> COMMASPACE.join(kwargs))) >> >> {}.pop() is kind of nice, isn't it? :) > >Yeah, dict.pop is fun, but overall I don't like this one. 'if' >statements that only trigger once inside a loop really bother me, for >performance reasons. You have to test that test every single time, and >it's only ever true the first time. It doesn't bother me from a performance reason, but aesthetically it *is* ugly. I think that's relatively easy to fix: # Yes, three nested lists because blah blah blah. args = [[[service, account_id, message_id]]] # Skip message_ids because blah blah blah. for column_name, type_code in SCHEMA[1:]: # If not given, the column's value is an empty string. args.append(kwargs.pop(column_name, '')) # All keyword arguments must correspond to column names, so any # leftover arguments are not allowed. if len(kwargs) > 0: raise TypeError('Unexpected keyword arguments: {}'.format( COMMASPACE.join(kwargs))) Fine with me either way though. >>> + with publish_lock: >>> + # Don't let duplicate messages into the model, but do >>> + # record the unique message ids of each duplicate message. >>> + stripped = strip(args[sender] + args[message]) >>> + for row in SharedModel: >>> + if stripped == strip(row[sender] + row[message]): >>> + row[ids] = row[ids] + args[ids] >>> + return True >> >> This code just confuses the hell out of me. I had to re-read it several >> times, and I'm still not sure why the concatenation occurs. Is that tested? > >The concatenation joins together the sender's name with his message for >comparison. If you leave out the sender's name, then you prevent two >different people from being able to say the same thing. Like Alice >tweets 'hi' and then Bob tweets 'hi' and Bob's message is lost because >Alice said it first. By concatenating, you then are comparing 'Alicehi' >to 'Bobhi' and you see that they are different messages, so they get >their own rows in the model. Another excellent explanation I might just cut-n-paste into the code. :) >> Maybe one way to make it more intelligible would be to rename strip() to >> unique_key() and pass in the entire row. The you might see something like: > >Nope, you can't pass in the whole row because then the service name >becomes part of the string, and it no longer is able to realize that >'Alicehi' on twitter is the same message as 'Alicehi' on facebook, >because 'twitterAlicehi' is a different string than 'facebookAlicehi'. > >It turns out that concatenating the sender's name and message text >*only* is the correct approach for fuzzy equivalence testing ;-) No, because it's just that strip() or whatever would pull out row[sender] and row[message] and do the concatenation itself. It just moves that logic into the method that calculates the unique key from the row data. >> Also, is there no more efficient way of asking whether the key already >> exists in the database than iterating through all the rows? Couldn't that >> get extremely inefficient? > >That's a good question, I'm not sure. The problem is that >Dee.SharedModel is kind of a custom gobjecty thing, and it's not well >documented, and I'm not sure how good the python overrides are (it has >some but seemingly not all of the methods you would expect to be able to >treat it as a true iterable). The best I could find is this: http://developer.ubuntu.com/api/ubuntu-12.04/python/Dee-1.0.html SharedModel implements this: http://developer.ubuntu.com/api/ubuntu-12.04/python/Dee-1.0.html#Dee.Model and yeah, the documentation is, um, sparse. So it's probably worth playing around with this to see if there's a better way to find the row. find_row_sorted() looks promising, as does Dee.Index.lookup(). You gotta figure there's an efficient way to query the table. >I was hoping that we might just maintain a global set() instance that >can contain pre-stripped strings, and then we can just test for >membership there, but then we still need the ability to append to the >message_ids 'aas' in the event that we are dealing with a duplicate. I >suppose what we really need then is a global dict() instance, where the >keys are our stripped strings, and the values are DeeModelIters, such >that we can get the DeeModelIter back easily and then modify that row >when we are adding a duplicate message. > >>> + return bool(SharedModel.append(*args)) >> >> What does SharedModel.append() actually return? > >A DeeModelIter, which can be used to modify the row later. Very much >like a GtkListStore if you're familiar with those. I'm not actually >aware of what might cause .append to fail, but it just seemed like the >smart thing to do to be able to report a failure as False, but then >again it's not actually clear to me that any other code is going to >check or care about the return value from Base._publish. I guess my question is, what is the truth value of a DeeModelIter? There's no documentation on it, or whether .append() can return None or something else if it fails. Either we'll have to do some experiments to find out what it does in practice, or maybe just kill the return value. >>> +# Most of this schema is very straightforward, but the 'service' >> >> I think you mean s/service/message_ids/ right? > >Yep, changed the name and forgot to update the comment. > >>> +# ['facebook', '2/facebook', '12345'], >>> +# ['twitter', '3/twitter', '987654'], >>> +# ] >> >> Could you add an explanation of *why* the inner one is an array? I vaguely >> recall you and Ken discussing that this could be a hash of the concatenated >> values. A comment describing why this design was chosen would be helpful I >> think. > >No, you're getting two different conversations confused. the 'aas' thing >used to be an 'as' from years ago, when 'aas' was buggy/broken in >python. The string was concatenated with colons, and thus looked like >'twitter:3/twitter:987654' and then libgwibber had to do a bunch of >expensive string manipulations on it, stupidly. > >ken and I agreed that 'aas' is actually the correct data type for the >information being contained, the previous 'as' was just an ugly hack >workaround for a bug that was fixed long ago. Excellent. >The hashing thing that ken and I were talking about was in regards to >the strip function above. ken mentioned wanting the comparison done on >hashes instead of on stripped strings, but then he showed me the >existing vala code (which doesn't do any hashing) and I didn't see any >reason to add any hashing to that algorithm, so I didn't bother). Cool. It serves as an opaque string either way.