Merge lp:~widelands-dev/widelands-website/update_beautifulsoup4 into lp:widelands-website

Proposed by kaputtnik
Status: Merged
Merged at revision: 507
Proposed branch: lp:~widelands-dev/widelands-website/update_beautifulsoup4
Merge into: lp:widelands-website
Diff against target: 434 lines (+153/-113)
5 files modified
mainpage/templatetags/wl_markdown.py (+114/-91)
mainpage/views.py (+1/-1)
pip_requirements.txt (+2/-1)
pybb/util.py (+35/-14)
settings.py (+1/-6)
To merge this branch: bzr merge lp:~widelands-dev/widelands-website/update_beautifulsoup4
Reviewer Review Type Date Requested Status
Widelands Developers Pending
Review via email: mp+358571@code.launchpad.net

Commit message

Update BeautifulSoup and make needed changes

Description of the change

Update BeautifulSoup3 to BeautifulSoup4. This is a prerequisite for the python update.

In contrary to bs3, bs4 escapes all (python) strings, so it is not possible anymore to apply just a (unicode)string like "<a href://example.com>LINKTEXT</a>". This results to "&lt;a href://example.com&gt;LINKTEXT&lt;/a&gt;" for a BS4 object.

This branch takes care of this and modifies the used code to use BeautifulSoup4 objects. The new code may can be smarter, but i find it understandable.

I have also refactored some variables and comments.

The rendering times are close to equal in comparison with BeautifulSoup3. E.g.

For the Developers page: /developers/
bs3: ~0.62s
bs4: ~0.45s

For /wiki/WikiSyntax/
bs3: ~0.14s
bs4: ~0.14s

The Regular expression for finding pasted plain text-links is tested here: https://regexr.com/42pq5

I have also removed the SMILEY_PREESCAPING things, because it works as is right now. The only problem is: The 'develish' smiley won't work if it is placed as the first characters. I am in favor to replace '>:-)' with ']:-)' to fix this. Any remarks for this?

To post a comment you must log in.
523. By kaputtnik

removed misspelled doc-string

Revision history for this message
GunChleoc (gunchleoc) wrote :

2 comments.

Since smilies aren't being used much, I'd be OK with changing the code for the devil.

524. By kaputtnik

stringfix; use of conistent string identifier

525. By kaputtnik

change devilish smiley

Revision history for this message
kaputtnik (franku) wrote :

Regarding ftp: A good point :-)

I did a query on the database finding posts with 'ftp:' and this gave no results. So such links weren't used so far. AFAIK ftp is used to directly point to downloadable content. Not sure if we should allow such links at all, but such links can ever be provided with the markdown syntax
[linkname](ftp://some_file.txt)

Thanks for the review :-)

526. By kaputtnik

merged trunk

527. By kaputtnik

removed silly list comprehention

528. By kaputtnik

merged trunk

Revision history for this message
kaputtnik (franku) wrote :

This is merged and deployed now.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'mainpage/templatetags/wl_markdown.py'
2--- mainpage/templatetags/wl_markdown.py 2017-11-14 16:54:28 +0000
3+++ mainpage/templatetags/wl_markdown.py 2018-11-18 10:39:28 +0000
4@@ -25,7 +25,7 @@
5 import urllib
6 import bleach
7
8-from BeautifulSoup import BeautifulSoup, NavigableString
9+from bs4 import BeautifulSoup, NavigableString
10
11 # If we can import a Wiki module with Articles, we
12 # will check for internal wikipages links in all internal
13@@ -38,8 +38,7 @@
14
15 # We will also need the site domain
16 from django.contrib.sites.models import Site
17-from settings import SITE_ID, SMILEYS, SMILEY_DIR, \
18- SMILEY_PREESCAPING
19+from settings import SITE_ID, SMILEYS, SMILEY_DIR
20
21 try:
22 _domain = Site.objects.get(pk=SITE_ID).domain
23@@ -61,41 +60,56 @@
24 """This searches for smiley symbols in the current text and replaces them
25 with the correct images.
26
27- Only replacing if smiley symbols aren't in a word (e.g. http://....)
28-
29+ Contents get splitted into words and after this the whole contents must be
30+ reassembled.
31 """
32- words = text.split(' ')
33- for sc, img in SMILEYS:
34- if sc in words:
35- words[words.index(
36- sc)] = "<img src='%s%s' alt='%s' />" % (SMILEY_DIR, img, img)
37- text = ' '.join(words)
38- return text
39-
40-
41-def _insert_smiley_preescaping(text):
42- """This searches for smiley symbols in the current text and replaces them
43- with the correct images."""
44- for before, after in SMILEY_PREESCAPING:
45- text = text.replace(before, after)
46- return text
47+
48+ tmp_content = []
49+ for content in text.parent.contents:
50+ try:
51+ # If this fails, content is probably '\n' or not a string, e.g. <br />
52+ words = content.split(' ')
53+ except:
54+ # apply the unsplittable content and continue
55+ tmp_content.append(content)
56+ continue
57+
58+ for i, word in enumerate(words):
59+ smiley = ''
60+ for sc, img in SMILEYS:
61+ if word == sc:
62+ smiley = img
63+ if smiley:
64+ img_tag = BeautifulSoup(features='lxml').new_tag('img')
65+ img_tag['src'] = '{}{}'.format(SMILEY_DIR, smiley)
66+ img_tag['alt'] = smiley
67+ tmp_content.append(img_tag)
68+ # apply a space after the smiley
69+ tmp_content.append(NavigableString(' '))
70+ else:
71+ if i < (len(words) - 1):
72+ # Apply a space after each word, except the last word
73+ word = word + ' '
74+ tmp_content.append(NavigableString(word))
75+
76+ text.parent.contents = tmp_content
77
78
79 def _classify_link(tag):
80- """Returns a classname to insert if this link is in any way special
81+ """Applies a classname if this link is in any way special
82 (external or missing wikipages)
83
84- tag to classify for
85+ tag: classify for this tag
86
87 """
88 # No class change for image links
89- if tag.findChild('img') != None:
90- return None
91+ if tag.next_element.name == 'img':
92+ return
93
94 try:
95 href = tag['href'].lower()
96 except KeyError:
97- return None
98+ return
99
100 # Check for external link
101 if href.startswith('http'):
102@@ -105,56 +119,86 @@
103 external = False
104 break
105 if external:
106- return {'class': 'externalLink', 'title': 'This link refers to outer space'}
107+ tag['class'] = 'externalLink'
108+ tag['title'] = 'This link refers to outer space'
109+ return
110
111 if '/profile/' in (tag['href']):
112- return {'class': 'userLink', 'title': 'This link refers to a userpage'}
113+ tag['class'] = 'userLink'
114+ tag['title'] = 'This link refers to a userpage'
115+ return
116
117 if check_for_missing_wikipages and href.startswith('/wiki/'):
118
119 # Check for missing wikilink /wiki/PageName[/additionl/stuff]
120 # Using href because we need cAsEs here
121- pn = urllib.unquote(tag['href'][6:].split('/', 1)[0])
122+ article_name = urllib.unquote(tag['href'][6:].split('/', 1)[0])
123
124- if not len(pn): # Wiki root link is not a page
125- return {'class': 'wrongLink', 'title': 'This Link misses an articlename'}
126+ if not len(article_name): # Wiki root link is not a page
127+ tag['class'] = 'wrongLink'
128+ tag['title'] = 'This Link misses an articlename'
129+ return
130
131 # Wiki special pages are also not counted
132- if pn in ['list', 'search', 'history', 'feeds', 'observe', 'edit']:
133- return {'class': 'specialLink'}
134+ if article_name in ['list', 'search', 'history', 'feeds', 'observe', 'edit']:
135+ tag['class'] = 'specialLink'
136+ return
137
138 # Check for a redirect
139 try:
140 # try to get the article id; if this fails an IndexError is raised
141 a_id = ChangeSet.objects.filter(
142- old_title=pn).values_list('article_id')[0]
143+ old_title=article_name).values_list('article_id')[0]
144
145 # get actual title of article
146 act_t = Article.objects.get(id=a_id[0]).title
147- if pn != act_t:
148- return {'title': "This is a redirect and points to \"" + act_t + "\""}
149+ if article_name != act_t:
150+ tag['title'] = 'This is a redirect and points to \"" + act_t + "\"'
151+ return
152 else:
153- return None
154+ return
155 except IndexError:
156 pass
157
158 # article missing (or misspelled)
159- if Article.objects.filter(title=pn).count() == 0:
160- return {'class': 'missingLink', 'title': 'This Link is misspelled or missing. Click to create it anyway.'}
161-
162- return None
163-
164-
165-def _clickable_image(tag):
166+ if Article.objects.filter(title=article_name).count() == 0:
167+ tag['class'] = 'missingLink'
168+ tag['title'] = 'This Link is misspelled or missing. Click to create it anyway.'
169+ return
170+ return
171+
172+
173+def _make_clickable_images(tag):
174 # is external link?
175 if tag['src'].startswith('http'):
176- # is allways a link?
177+ # Do not change if it is already a link
178 if tag.parent.name != 'a':
179 # add link to image
180- text = '<a href=' + tag['src'] + \
181- '><img src=' + tag['src'] + '></a>'
182- return text
183- return None
184+ new_link = BeautifulSoup(features='lxml').new_tag('a')
185+ new_link['href'] = tag['src']
186+ new_img = BeautifulSoup(features='lxml').new_tag('img')
187+ new_img['src'] = tag['src']
188+ new_img['alt'] = tag['alt']
189+ new_link.append(new_img)
190+ tag.replace_with(new_link)
191+ return
192+
193+
194+def find_smiley_Strings(bs4_string):
195+ """Find strings that contain a smiley symbol.
196+
197+ Don't find a smiley in code tags.
198+ Attention: This returns also True for ':/' in 'http://'. This get
199+ fixed in _insert_smileys().
200+ """
201+
202+ if bs4_string.parent.name.lower() == 'code':
203+ return False
204+
205+ for sc in SMILEYS:
206+ if sc[0] in bs4_string:
207+ return True
208+ return False
209
210
211 # Predefine the markdown extensions here to have a clean code in
212@@ -162,10 +206,9 @@
213 md_extensions = ['extra', 'toc', SemanticWikiLinkExtension()]
214
215 def do_wl_markdown(value, *args, **keyw):
216- # Do Preescaping for markdown, so that some things stay intact
217- # This is currently only needed for this smiley ">:-)"
218- value = _insert_smiley_preescaping(value)
219- custom = keyw.pop('custom', True)
220+ """Apply wl specific things, like smileys or colored links."""
221+
222+ beautify = keyw.pop('beautify', True)
223 html = smart_str(markdown(value, extensions=md_extensions))
224
225 # Sanitize posts from potencial untrusted users (Forum/Wiki/Maps)
226@@ -173,49 +216,29 @@
227 html = mark_safe(bleach.clean(
228 html, tags=BLEACH_ALLOWED_TAGS, attributes=BLEACH_ALLOWED_ATTRIBUTES))
229
230- # Since we only want to do replacements outside of tags (in general) and not between
231- # <a> and </a> we partition our site accordingly
232- # BeautifoulSoup does all the heavy lifting
233- soup = BeautifulSoup(html)
234+ # Prepare the html and apply smileys and classes.
235+ # BeautifulSoup objects are all references, so changing a variable
236+ # derived from the soup will take effect on the soup itself.
237+ # Because of that the called functions will modify the soup directly.
238+ soup = BeautifulSoup(html, features='lxml')
239 if len(soup.contents) == 0:
240 # well, empty soup. Return it
241 return unicode(soup)
242
243- for text in soup.findAll(text=True):
244- # Do not replace inside a link
245- if text.parent.name == 'a':
246- continue
247-
248- # We do our own small preprocessing of the stuff we got, after markdown
249- # went over it General consensus is to avoid replacing anything in
250- # links [blah](blkf)
251- if custom:
252- rv = text
253- # Replace smileys; only outside "code-tags"
254- if not text.parent.name == 'code':
255- rv = _insert_smileys(rv)
256-
257- text.replaceWith(rv)
258-
259- # This call slows the whole function down...
260- # unicode->reparsing.
261- # The function goes from .5 ms to 1.5ms on my system
262- # Well, for our site with it's little traffic it's maybe not so important...
263- # What a waste of cycles :(
264- soup = BeautifulSoup(unicode(soup))
265- # We have to go over this to classify links
266- for tag in soup.findAll('a'):
267- rv = _classify_link(tag)
268- if rv:
269- for attribute in rv.iterkeys():
270- tag[attribute] = rv.get(attribute)
271-
272- # All external images gets clickable
273- # This applies only in forum
274- for tag in soup.findAll('img'):
275- link = _clickable_image(tag)
276- if link:
277- tag.replaceWith(link)
278+ if beautify:
279+ # Insert smileys
280+ smiley_text = soup.find_all(string=find_smiley_Strings)
281+ for text in smiley_text:
282+ _insert_smileys(text)
283+
284+ # Classify links
285+ for tag in soup.find_all('a'):
286+ _classify_link(tag)
287+
288+ # All external images gets clickable
289+ # This applies only in forum
290+ for tag in soup.find_all('img'):
291+ _make_clickable_images(tag)
292
293 return unicode(soup)
294
295
296=== modified file 'mainpage/views.py'
297--- mainpage/views.py 2018-04-03 05:18:03 +0000
298+++ mainpage/views.py 2018-11-18 10:39:28 +0000
299@@ -126,7 +126,7 @@
300 except IOError:
301 txt = txt + "Couldn't find developer file!"
302
303- txt = do_wl_markdown(txt, custom=False)
304+ txt = do_wl_markdown(txt, beautify=False)
305
306 return render(request, 'mainpage/developers.html',
307 {'developers': txt}
308
309=== modified file 'pip_requirements.txt'
310--- pip_requirements.txt 2018-10-03 11:03:43 +0000
311+++ pip_requirements.txt 2018-11-18 10:39:28 +0000
312@@ -1,6 +1,6 @@
313 # Python requirements for widelands-website at 22.06.2017
314
315-BeautifulSoup==3.2.0
316+beautifulsoup4==4.6.3
317 Django==1.11.12
318 django-haystack==2.8.1
319 # django-messages is very old on pypi
320@@ -11,6 +11,7 @@
321 django-registration==2.4.1
322 django-tagging==0.4.5
323 gunicorn==19.7.1
324+lxml==4.2.5
325 Markdown==2.6.8
326 mysqlclient==1.3.10
327 numpy==1.13.0
328
329=== modified file 'pybb/util.py'
330--- pybb/util.py 2018-10-01 16:41:29 +0000
331+++ pybb/util.py 2018-11-18 10:39:28 +0000
332@@ -2,8 +2,9 @@
333 import random
334 import traceback
335 import json
336+import re
337
338-from BeautifulSoup import BeautifulSoup
339+from bs4 import BeautifulSoup, NavigableString
340 from datetime import datetime
341 from django.shortcuts import render
342 from django.http import HttpResponse
343@@ -11,7 +12,6 @@
344 from django.utils.translation import check_for_language
345 from django.utils.encoding import force_unicode
346 from django import forms
347-from django.template.defaultfilters import urlize as django_urlize
348 from django.core.paginator import Paginator, EmptyPage, InvalidPage
349 from django.conf import settings
350 from pybb import settings as pybb_settings
351@@ -145,6 +145,16 @@
352 return form
353
354
355+PLAIN_LINK_RE = re.compile(r'(http[s]?:\/\/[-a-zA-Z0-9@:%._\+~#=/?]+)')
356+def exclude_code_tag(bs4_string):
357+ if bs4_string.parent.name == 'code':
358+ return False
359+ m = PLAIN_LINK_RE.search(bs4_string)
360+ if m:
361+ return True
362+ return False
363+
364+
365 def urlize(data):
366 """Urlize plain text links in the HTML contents.
367
368@@ -152,18 +162,29 @@
369
370 """
371
372- soup = BeautifulSoup(data)
373- for chunk in soup.findAll(text=True):
374- islink = False
375- ptr = chunk.parent
376- while ptr.parent:
377- if ptr.name == 'a' or ptr.name == 'code':
378- islink = True
379- break
380- ptr = ptr.parent
381- if not islink:
382- # Using unescape to prevent conversation of f.e. &gt; to &amp;gt;
383- chunk = chunk.replaceWith(django_urlize(unicode(unescape(chunk))))
384+ soup = BeautifulSoup(data, 'lxml')
385+ for found_string in soup.find_all(string=exclude_code_tag):
386+ new_content = []
387+ strings_or_tags = found_string.parent.contents
388+ for string_or_tag in strings_or_tags:
389+ try:
390+ for string in PLAIN_LINK_RE.split(string_or_tag):
391+ if string.startswith('http'):
392+ # Apply an a-Tag
393+ tag = soup.new_tag('a')
394+ tag['href'] = string
395+ tag.string = string
396+ tag['nofollow'] = 'true'
397+ new_content.append(tag)
398+ else:
399+ # This is just a string, apply a bs4-string
400+ new_content.append(NavigableString(string))
401+ except:
402+ # Regex failed, so apply what ever it is
403+ new_content.append(string_or_tag)
404+
405+ # Apply the new content
406+ found_string.parent.contents = new_content
407
408 return unicode(soup)
409
410
411=== modified file 'settings.py'
412--- settings.py 2018-10-05 19:10:18 +0000
413+++ settings.py 2018-11-18 10:39:28 +0000
414@@ -213,8 +213,7 @@
415 (':))', 'face-smile-big.png'),
416 (':-)', 'face-smile.png'),
417 (':)', 'face-smile.png'),
418- # Hack around markdown replacement. see also SMILEY_PREESCAPING
419- ('&gt;:-)', 'face-devilish.png'),
420+ (']:-)', 'face-devilish.png'),
421 ('8-)', 'face-glasses.png'),
422 ('8)', 'face-glasses.png'),
423 (':-D', 'face-grin.png'),
424@@ -243,10 +242,6 @@
425 (';-)', 'face-wink.png'),
426 (';)', 'face-wink.png'),
427 ]
428-# This needs to be done to keep some stuff hidden from markdown
429-SMILEY_PREESCAPING = [
430- ('>:-)', '\>:-)')
431-]
432
433 #################
434 # Search Config #

Subscribers

People subscribed via source and target branches