Lernid

Code review comment for lp:~coalwater/lernid/fix-810122

fix-810122
Merge into trunk

Revision history for this message

John S. Gruber (jsjgruber) wrote on 2012-04-24:

> Ok i could add those missing characters, ( just enough to make those examples
> you posted work ) just keep in mind this won't be perfect till people start to
> use it and other cases start to appear, here's a list of the characters that
> need to be added ( the ones between the [ ] ).
> [ % ; , + # ]
> I will try update the branch with this and the other bug about the [ ] for
> ignoring URL's soon hopefully

This doesn't seem like a good idea to me. What we have now has been working for a long time, and
the only problem at hand seems to be the trailing "," . We could be trading a fix for that for
a bunch more problems -- not a good trade. We should do our best not to introduce any regressions.

Seems to me that there are two possible ways to go:

1) Update the regex to include all of the characters possible in the URL. The bnf in
http://tools.ietf.org/html/rfc3986 gives the lists. (bnf makes my head hurt, but
that's a safer way to do it. Let me know if you want to do it this way and want my help
interpretting the RFC.

2) We could also steal an approach from somewhere else that is doing it well. (Launchpad, for example).
http://bazaar.launchpad.net/~launchpad-pqm/launchpad/devel/view/head:/lib/lp/app/browser/stringformatter.py
but that's pretty sophisticated. They look for the URL, match up parentheses, and remove trailing punctuation.
I didn't look at it long enough to really figure it out. Seems to me this might take quite a bit of our time.

Either 1) or 2) could open regressions, however.

3) Safe and effective, but not as comprehensive as 1) and 2), would be to simply add "," to the line
in cleanurl

if any([url.endswith(c) for c in '.;:]']):

We could then also add clean url to Ircbackend,
which would then fix the problem of trailing . ; : etc. in the links it marks. I didn't realize until now
that that problem existed. http://www.google.com; now has the ";" marked as part of the url, though it opens
correctly through the code in Browser.py it won't click properly. Thats problem has been there for ages, with no one ever noticing it, as far as I know.

I wish I had tried more URL's through your regex much sooner. The things I just tried were only links that occured to me when I realized that your regex was sensitive to special characters in a URL. I'm sure there are many more
that would produce errors.

I just pushed again to lernid-proposed. I think it's getting closer to being ready to release.

Good luck and please let me know if you'd like some help.