Merge lp:~max-rabkin/ibid/url-translate into lp:~ibid-core/ibid/old-trunk-1.6

Proposed by Max Rabkin on 2010-02-24
Status: Merged
Approved by: Jonathan Hitchcock on 2010-03-01
Approved revision: 898
Merged at revision: 905
Proposed branch: lp:~max-rabkin/ibid/url-translate
Merge into: lp:~ibid-core/ibid/old-trunk-1.6
Diff against target: 138 lines (+48/-16)
3 files modified
ibid/plugins/languages.py (+15/-2)
ibid/plugins/urlgrab.py (+3/-14)
ibid/utils/__init__.py (+30/-0)
To merge this branch: bzr merge lp:~max-rabkin/ibid/url-translate
Reviewer Review Type Date Requested Status
Jonathan Hitchcock Approve on 2010-03-01
Stefano Rivera 2010-02-24 Approve on 2010-02-27
Michael Gorven Approve on 2010-02-24
Review via email: mp+20031@code.launchpad.net
To post a comment you must log in.
Max Rabkin (max-rabkin) wrote :

I've refactored the URL detection out of the Grab processor, so hopefully it works reasonably well.

Stefano Rivera (stefanor) wrote :

> I've refactored the URL detection out of the Grab processor, so hopefully it
> works reasonably well.

Nice!

35 + event.addresponse(u'http://translate.google.com/'
36 + u'translate?sl=%(src_lang)s'
37 + u'&tl=%(dest_lang)s&u=%(url)s',
38 + {'src_lang': src_lang,
39 + 'dest_lang': dest_lang,
40 + 'url': text})

That's a bit ick. urlencode?

ibid/plugins/urlgrab.py:16: 'locate_resource' imported but unused
ibid/utils/__init__.py:211: undefined name 'log'
(also, trailing white-space)

Functionality seems fine, though.

review: Needs Fixing
Stefano Rivera (stefanor) :
review: Approve
Michael Gorven (mgorven) wrote :

 review approve
 status approved

review: Approve
Stefano Rivera (stefanor) wrote :

Doesn't work with IDNs

Stefano Rivera (stefanor) wrote :

my test cases: http://παράδειγμα.δοκιμή/
and http://παράδειγμα.δοκιμή/Αρχική_σελίδα

Stefano Rivera (stefanor) wrote :

max: looks like google doesn't support IDNs in google translate, but let's not throw an exception:

Query: translate http://παράδειγμα.δοκιμή/Αρχική_σελίδα to english
ERROR:scripts.ibid-plugin:Exception occured in Translate processor of languages plugin
Traceback (most recent call last):
  File "scripts/ibid-plugin", line 129, in <module>
    processor.process(event)
  File "./ibid/plugins/__init__.py", line 119, in process
    method(event, *match.groups())
  File "./ibid/plugins/languages.py", line 190, in translate
    urlencode(query))
  File "/usr/lib/python2.5/urllib.py", line 1250, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-16: ordinal not in range(128)
WARNING:plugins.unicode:Found a non-unicode string: exception
Response: I'm not feeling too well

review: Needs Fixing
Jonathan Hitchcock (vhata) wrote :

Also, for consistency, please use a \ for multi-line imports (see factoids, seen, memo, ibid.db, etc).

That is all.

review: Needs Fixing
Max Rabkin (max-rabkin) wrote :

Is there a function in the standard libraries to convert IDNs and IRIs to URIs?

> Is there a function in the standard libraries to convert IDNs and IRIs to
> URIs?

Check out http://jehiah.cz/archive/handling-idn-in-python

Michael Gorven (mgorven) wrote :

On Friday 26 February 2010 11:11:06 Max Rabkin wrote:
> Is there a function in the standard libraries to convert IDNs and IRIs to
> URIs?

ibid.utils.url_to_bytestring()

lp:~max-rabkin/ibid/url-translate updated on 2010-02-27
898. By Max Rabkin on 2010-02-27

handle IRIs in google translate

Max Rabkin (max-rabkin) wrote :

OK, both complaints should be fixed.

Stefano Rivera (stefanor) :
review: Approve
Jonathan Hitchcock (vhata) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ibid/plugins/languages.py'
2--- ibid/plugins/languages.py 2010-01-18 23:20:33 +0000
3+++ ibid/plugins/languages.py 2010-02-27 00:15:26 +0000
4@@ -3,12 +3,15 @@
5
6 from random import choice
7 import re
8+from urllib import urlencode
9+from urlparse import urlparse
10
11 from dictclient import Connection
12
13 from ibid.plugins import Processor, match
14 from ibid.config import Option, IntOption
15-from ibid.utils import decode_htmlentities, json_webservice, human_join
16+from ibid.utils import decode_htmlentities, json_webservice, human_join, \
17+ is_url, url_to_bytestring
18
19 help = {}
20
21@@ -129,7 +132,7 @@
22
23 help['translate'] = u'''Translates a phrase using Google Translate.'''
24 class Translate(Processor):
25- u"""translate <phrase> [from <language>] [to <language>]
26+ u"""translate (<phrase>|<url>) [from <language>] [to <language>]
27 translation chain <phrase> [from <language>] [to <language>]"""
28
29 feature = 'translate'
30@@ -180,6 +183,16 @@
31 dest_lang = self.language_code(dest_lang or self.dest_lang)
32 src_lang = self.language_code(src_lang or '')
33
34+ if is_url(text):
35+ if urlparse(text).scheme in ('', 'http'):
36+ url = url_to_bytestring(text)
37+ query = {'sl': src_lang, 'tl': dest_lang, 'u': url}
38+ event.addresponse(u'http://translate.google.com/translate?' +
39+ urlencode(query))
40+ else:
41+ event.addresponse(u'I can only translate HTTP pages')
42+ return
43+
44 try:
45 translated = self._translate(event, text, src_lang, dest_lang)[0]
46 event.addresponse(translated)
47
48=== modified file 'ibid/plugins/urlgrab.py'
49--- ibid/plugins/urlgrab.py 2010-02-06 10:01:01 +0000
50+++ ibid/plugins/urlgrab.py 2010-02-27 00:15:26 +0000
51@@ -13,7 +13,7 @@
52 from ibid.config import Option
53 from ibid.db import IbidUnicode, IbidUnicodeText, Integer, DateTime, \
54 Table, Column, ForeignKey, Base, VersionedSchema
55-from ibid.utils import locate_resource
56+from ibid.utils import url_regex
57 from ibid.utils.html import get_html_parse_tree
58
59 help = {}
60@@ -60,22 +60,11 @@
61 'delicious')
62
63 def setup(self):
64- tldfile = locate_resource('ibid', 'data/tlds-alpha-by-domain.txt')
65- if tldfile:
66- f = file(tldfile, 'r')
67- tlds = [tld.strip().lower() for tld in f.readlines()
68- if not tld.startswith('#')]
69- f.close()
70- else:
71- log.warning(u"Couldn't open TLD list, falling back to minimal default")
72- tlds = 'com.org.net.za'.split('.')
73-
74 self.grab.im_func.pattern = re.compile((
75 r'(?:[^@./]\b(?!\.)|\A)(' # Match a boundary, but not on an e-mail address
76- r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
77- r'|[^@\s:/]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD
78+ + url_regex() +
79 r')[\[>)\]"\'.,;:]*(?:\s|\Z)' # End boundary
80- ) % '|'.join(tlds), re.I | re.DOTALL)
81+ ), re.I | re.DOTALL)
82
83 @handler
84 def grab(self, event, url):
85
86=== modified file 'ibid/utils/__init__.py'
87--- ibid/utils/__init__.py 2010-02-20 15:23:24 +0000
88+++ ibid/utils/__init__.py 2010-02-27 00:15:27 +0000
89@@ -3,6 +3,7 @@
90
91 from gzip import GzipFile
92 from htmlentitydefs import name2codepoint
93+import logging
94 import os
95 import os.path
96 import re
97@@ -23,6 +24,8 @@
98 import ibid
99 from ibid.compat import defaultdict, json
100
101+log = logging.getLogger('utils')
102+
103 def ago(delta, units=None):
104 parts = []
105
106@@ -195,6 +198,33 @@
107 parts[2] = quote(parts[2].encode('utf-8'), '/%')
108 return urlunparse(parts).encode('utf-8')
109
110+_url_regex = None
111+
112+def url_regex ():
113+ global _url_regex
114+ if _url_regex is not None:
115+ return _url_regex
116+
117+ tldfile = locate_resource('ibid', 'data/tlds-alpha-by-domain.txt')
118+ if tldfile:
119+ f = file(tldfile, 'r')
120+ tlds = [tld.strip().lower() for tld in f.readlines()
121+ if not tld.startswith('#')]
122+ f.close()
123+ else:
124+ log.warning(u"Couldn't open TLD list, falling back to minimal default")
125+ tlds = 'com.org.net.za'.split('.')
126+
127+ _url_regex = (
128+ r'(?:\w+://|(?:www|ftp)\.)\S+?' # Match an explicit URL or guess by www.
129+ r'|[^@\s:/]+\.(?:%s)(?:/\S*?)?' # Guess at the URL based on TLD
130+ ) % '|'.join(tlds)
131+
132+ return _url_regex
133+
134+def is_url(url):
135+ return re.match('^' + url_regex() + '$', url, re.I)
136+
137 def json_webservice(url, params={}, headers={}):
138 "Request data from a JSON webservice, and deserialise"
139

Subscribers

People subscribed via source and target branches