Merge lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes into lp:developer-ubuntu-com

Proposed by Daniel Holbach
Status: Merged
Approved by: David Callé
Approved revision: 212
Merged at revision: 194
Proposed branch: lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes
Merge into: lp:developer-ubuntu-com
Diff against target: 480 lines (+189/-72)
7 files modified
md_importer/importer/__init__.py (+12/-0)
md_importer/importer/article.py (+37/-22)
md_importer/importer/publish.py (+54/-22)
md_importer/importer/repo.py (+0/-1)
md_importer/tests/test_branch_import.py (+40/-6)
md_importer/tests/test_link_rewrite.py (+43/-20)
md_importer/tests/utils.py (+3/-1)
To merge this branch: bzr merge lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes
Reviewer Review Type Date Requested Status
Ubuntu App Developer site developers Pending
Review via email: mp+284309@code.launchpad.net

Description of the change

This is ready to land now.

List of fixes:
 - only use a shortlist of markdown extensions
 - fix the rewriting of links in articles (links between articles in the
   same branch), fix and extend tests
 - simplify code somewhat, remove useless bits
 - fix stripping of tags like <body>, add tests

To post a comment you must log in.
192. By Daniel Holbach

only set page to public_object if it exists

193. By Daniel Holbach

remove unnecessary reset of repo.pages

194. By Daniel Holbach

merge from trunk

195. By Daniel Holbach

- make TestLinkRewrite link check explicit
- fix condition in TestLinkBrokenRewrite to not leave for loop early

196. By Daniel Holbach

bring TestSnapcraftLinkRewrite test closer to reality and just import what's important to us, fix condition to not leave for loop early

197. By Daniel Holbach

remove unnecessary line

198. By Daniel Holbach

avoid 'break' in the for loop

199. By Daniel Holbach

break out update_page functionality into separate function

200. By Daniel Holbach

store list of local images and if links were rewritten in the article object, use the new update_page function

201. By Daniel Holbach

add TODO item, make pyflakes and pep8 happy

202. By Daniel Holbach

remove body/html tags after soup.prettify

203. By Daniel Holbach

add test to ensure we strip all <body> tags from the imported articles

204. By Daniel Holbach

make sure internal links start with '/'

205. By Daniel Holbach

fix tests wrt fixed links

206. By Daniel Holbach

remove stray print

207. By Daniel Holbach

make regexes for stripping body/html/head tags clearer

208. By Daniel Holbach

drop pymdownx.headeranchor - it creates problems (order of link attributes gets mixed up depending on which html output we use), and we don't need it on the page

209. By Daniel Holbach

- when comparing HTML, always use clean_html from djangocms_text_ckeditor
  and soup.prettify, so we're looking at the same output style
- add convenience function find_text_plugin
- check not only if the draft's html has changed, but also if the published
  version changed
- update test as well, check only published pages

210. By Daniel Holbach

make sure we don't have 'None' as slug for the root node, add tests (one for the URL, one for links in the HTML)

211. By Daniel Holbach

cater to the use-case where we just import snappy docs, but have no release_alias (ie current) set

212. By Daniel Holbach

fix typo

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'md_importer/importer/__init__.py'
2--- md_importer/importer/__init__.py 2016-01-12 11:44:04 +0000
3+++ md_importer/importer/__init__.py 2016-03-02 11:13:56 +0000
4@@ -3,3 +3,15 @@
5 DEFAULT_LANG = LANGUAGE_CODE
6 HOME_PAGE_URL = '/{}/'.format(DEFAULT_LANG)
7 SUPPORTED_ARTICLE_TYPES = ['.md', '.html']
8+
9+# Instead of just using pymdownx.github, we go with these because of
10+# https://github.com/facelessuser/pymdown-extensions/issues/11
11+MARKDOWN_EXTENSIONS = [
12+ 'markdown.extensions.tables',
13+ 'pymdownx.magiclink',
14+ 'pymdownx.betterem',
15+ 'pymdownx.tilde',
16+ 'pymdownx.githubemoji',
17+ 'pymdownx.tasklist',
18+ 'pymdownx.superfences',
19+]
20
21=== modified file 'md_importer/importer/article.py'
22--- md_importer/importer/article.py 2016-01-15 13:56:34 +0000
23+++ md_importer/importer/article.py 2016-03-02 11:13:56 +0000
24@@ -8,9 +8,10 @@
25
26 from . import (
27 DEFAULT_LANG,
28+ MARKDOWN_EXTENSIONS,
29 SUPPORTED_ARTICLE_TYPES,
30 )
31-from .publish import get_or_create_page, slugify
32+from .publish import get_or_create_page, slugify, update_page
33
34 if sys.version_info.major == 2:
35 from urlparse import urlparse
36@@ -27,18 +28,18 @@
37 self.write_to = slugify(self.fn)
38 self.full_url = write_to
39 self.slug = os.path.basename(self.full_url)
40+ self.links_rewritten = False
41+ self.local_images = []
42
43 def _find_local_images(self):
44 '''Local images are currently not supported.'''
45 soup = BeautifulSoup(self.html, 'html5lib')
46- local_images = []
47 for img in soup.find_all('img'):
48 if img.has_attr('src'):
49 (scheme, netloc, path, params, query, fragment) = \
50 urlparse(img.attrs['src'])
51 if scheme not in ['http', 'https']:
52- local_images.extend([img.attrs['src']])
53- return local_images
54+ self.local_images.extend([img.attrs['src']])
55
56 def read(self):
57 if os.path.splitext(self.fn)[1] not in SUPPORTED_ARTICLE_TYPES:
58@@ -50,13 +51,13 @@
59 self.html = markdown.markdown(
60 f.read(),
61 output_format='html5',
62- extensions=['pymdownx.github'])
63+ extensions=MARKDOWN_EXTENSIONS)
64 elif self.fn.endswith('.html'):
65 self.html = f.read()
66- local_images = self._find_local_images()
67- if local_images:
68+ self._find_local_images()
69+ if self.local_images:
70 logging.error('Found the following local image(s): {}'.format(
71- ', '.join(local_images)
72+ ', '.join(self.local_images)
73 ))
74 return False
75 self.title = self._read_title()
76@@ -73,10 +74,15 @@
77 return slugify(self.fn).replace('-', ' ').title()
78
79 def _remove_body_and_html_tags(self):
80- self.html = re.sub(r"<html>\n\s<body>\n", "", self.html,
81- flags=re.MULTILINE)
82- self.html = re.sub(r"\s<\/body>\n<\/html>", "", self.html,
83- flags=re.MULTILINE)
84+ for regex in [
85+ # These are added by markdown.markdown
86+ r'\s*<html>\s*<body>\s*',
87+ r'\s*<\/body>\s*<\/html>\s*',
88+ # This is added by BeautifulSoup.prettify
89+ r'\s*<html>\s*<head>\s*<\/head>\s*<body>\s*',
90+ ]:
91+ self.html = re.sub(regex, '', self.html,
92+ flags=re.MULTILINE)
93
94 def _use_developer_site_style(self):
95 begin = (u"<div class=\"row no-border\">"
96@@ -92,7 +98,6 @@
97
98 def replace_links(self, titles, url_map):
99 soup = BeautifulSoup(self.html, 'html5lib')
100- change = False
101 for link in soup.find_all('a'):
102 if not link.has_attr('class') or \
103 'headeranchor-link' not in link.attrs['class']:
104@@ -100,10 +105,12 @@
105 if title.endswith(link.attrs['href']) and \
106 link.attrs['href'] != url_map[title].full_url:
107 link.attrs['href'] = url_map[title].full_url
108- change = True
109- if change:
110+ if not link.attrs['href'].startswith('/'):
111+ link.attrs['href'] = '/' + link.attrs['href']
112+ self.links_rewritten = True
113+ if self.links_rewritten:
114 self.html = soup.prettify()
115- return change
116+ self._remove_body_and_html_tags()
117
118 def add_to_db(self):
119 '''Publishes pages in their branch alias namespace.'''
120@@ -112,13 +119,19 @@
121 html=self.html)
122 if not self.page:
123 return False
124- self.full_url = self.page.get_absolute_url()
125+ self.full_url = re.sub(
126+ r'^\/None\/', '/{}/'.format(DEFAULT_LANG),
127+ self.page.get_absolute_url())
128 return True
129
130 def publish(self):
131+ if self.links_rewritten:
132+ update_page(self.page, title=self.title, full_url=self.full_url,
133+ menu_title=self.title, html=self.html)
134 if self.page.is_dirty(DEFAULT_LANG):
135 self.page.publish(DEFAULT_LANG)
136- self.page = self.page.get_public_object()
137+ if self.page.get_public_object():
138+ self.page = self.page.get_public_object()
139 return self.page
140
141
142@@ -128,14 +141,16 @@
143 def read(self):
144 if not Article.read(self):
145 return False
146- self.release_alias = re.findall(r'snappy/guides/(\S+?)/\S+?',
147- self.full_url)[0]
148+ matches = re.findall(r'snappy/guides/(\S+?)/\S+?',
149+ self.full_url)
150+ if matches:
151+ self.release_alias = matches[0]
152 self._make_snappy_mods()
153 return True
154
155 def _make_snappy_mods(self):
156 # Make sure the reader knows which documentation she is browsing
157- if self.release_alias != 'current':
158+ if self.release_alias and self.release_alias != 'current':
159 before = (u"<div class=\"row no-border\">\n"
160 "<div class=\"eight-col\">\n")
161 after = (u"<div class=\"row no-border\">\n"
162@@ -158,6 +173,6 @@
163 redirect="/snappy/guides/current/{}".format(self.slug))
164 if not page:
165 return False
166- else:
167+ elif self.release_alias:
168 self.title += " (%s)" % (self.release_alias,)
169 return Article.add_to_db(self)
170
171=== modified file 'md_importer/importer/publish.py'
172--- md_importer/importer/publish.py 2016-01-15 13:58:39 +0000
173+++ md_importer/importer/publish.py 2016-03-02 11:13:56 +0000
174@@ -4,11 +4,18 @@
175 from cms.models import Title
176 from djangocms_text_ckeditor.html import clean_html
177
178+from bs4 import BeautifulSoup
179 import logging
180 import re
181 import os
182
183
184+def _compare_html(html_a, html_b):
185+ soup_a = BeautifulSoup(html_a, 'html5lib')
186+ soup_b = BeautifulSoup(html_b, 'html5lib')
187+ return (clean_html(soup_a.prettify()) == clean_html(soup_b.prettify()))
188+
189+
190 def slugify(filename):
191 return os.path.basename(filename).replace('.md', '').replace('.html', '')
192
193@@ -32,6 +39,51 @@
194 return parent_pages[0].page
195
196
197+def find_text_plugin(page):
198+ # We create the page, so we know there's just one placeholder
199+ placeholder = page.placeholders.all()[0]
200+ if placeholder.get_plugins():
201+ return (
202+ placeholder,
203+ placeholder.get_plugins()[0].get_plugin_instance()[0]
204+ )
205+ return (placeholder, None)
206+
207+
208+def update_page(page, title, full_url, menu_title=None,
209+ in_navigation=True, redirect=None, html=None):
210+ if page.get_title() != title:
211+ page.title = title
212+ if page.get_menu_title() != menu_title:
213+ page.menu_title = menu_title
214+ if page.in_navigation != in_navigation:
215+ page.in_navigation = in_navigation
216+ if page.get_redirect() != redirect:
217+ page.redirect = redirect
218+ if html:
219+ update = True
220+ (placeholder, plugin) = find_text_plugin(page)
221+ if plugin:
222+ if _compare_html(html, plugin.body):
223+ update = False
224+ elif page.get_public_object():
225+ (dummy, published_plugin) = \
226+ find_text_plugin(page.get_public_object())
227+ if published_plugin:
228+ if _compare_html(html, published_plugin.body):
229+ update = False
230+ if update:
231+ plugin.body = html
232+ plugin.save()
233+ else:
234+ # Reset draft
235+ page.get_draft_object().revert(DEFAULT_LANG)
236+ else:
237+ add_plugin(
238+ placeholder, 'RawHtmlPlugin',
239+ DEFAULT_LANG, body=html)
240+
241+
242 def get_or_create_page(title, full_url, menu_title=None,
243 in_navigation=True, redirect=None, html=None):
244 # First check if pages already exist.
245@@ -39,26 +91,8 @@
246 path__regex=full_url).filter(publisher_is_draft=True)
247 if pages:
248 page = pages[0].page
249- if page.get_title() != title:
250- page.title = title
251- if page.get_menu_title() != menu_title:
252- page.menu_title = menu_title
253- if page.in_navigation != in_navigation:
254- page.in_navigation = in_navigation
255- if page.get_redirect() != redirect:
256- page.redirect = redirect
257- if html:
258- # We create the page, so we know there's just one placeholder
259- placeholder = page.placeholders.all()[0]
260- if placeholder.get_plugins():
261- plugin = placeholder.get_plugins()[0].get_plugin_instance()[0]
262- if plugin.body != clean_html(html, full=False):
263- plugin.body = html
264- plugin.save()
265- else:
266- add_plugin(
267- placeholder, 'RawHtmlPlugin',
268- DEFAULT_LANG, body=html)
269+ update_page(page, title, full_url, menu_title, in_navigation,
270+ redirect, html)
271 else:
272 parent = _find_parent(full_url)
273 if not parent:
274@@ -70,6 +104,4 @@
275 position='last-child', redirect=redirect)
276 placeholder = page.placeholders.get()
277 add_plugin(placeholder, 'RawHtmlPlugin', DEFAULT_LANG, body=html)
278- placeholder = page.placeholders.all()[0]
279- plugin = placeholder.get_plugins()[0].get_plugin_instance()[0]
280 return page
281
282=== modified file 'md_importer/importer/repo.py'
283--- md_importer/importer/repo.py 2016-01-15 18:54:50 +0000
284+++ md_importer/importer/repo.py 2016-03-02 11:13:56 +0000
285@@ -118,7 +118,6 @@
286 logging.error('Publishing of {} aborted.'.format(self.origin))
287 return False
288 article.replace_links(self.titles, self.url_map)
289- self.pages = []
290 for article in self.imported_articles:
291 self.pages.extend([article.publish()])
292 if self.index_page:
293
294=== modified file 'md_importer/tests/test_branch_import.py'
295--- md_importer/tests/test_branch_import.py 2016-01-15 13:59:32 +0000
296+++ md_importer/tests/test_branch_import.py 2016-03-02 11:13:56 +0000
297@@ -2,9 +2,10 @@
298 import pytz
299 import shutil
300
301-from cms.models import CMSPlugin, Page
302+from cms.models import Page
303
304 from md_importer.importer.article import Article
305+from md_importer.importer.publish import find_text_plugin
306 from .utils import TestLocalBranchImport
307
308
309@@ -66,6 +67,39 @@
310 self.assertEqual(page.parent_id, self.root.id)
311
312
313+class TestArticleHTMLTagsAfterImport(TestLocalBranchImport):
314+ def runTest(self):
315+ self.create_repo('data/snapcraft-test')
316+ self.repo.add_directive('docs', '')
317+ self.assertEqual(len(self.repo.directives), 1)
318+ self.assertTrue(self.repo.execute_import_directives())
319+ self.assertGreater(len(self.repo.imported_articles), 3)
320+ self.assertTrue(self.repo.publish())
321+ pages = Page.objects.all()
322+ self.assertGreater(pages.count(), len(self.repo.imported_articles))
323+ for article in self.repo.imported_articles:
324+ self.assertIsInstance(article, Article)
325+ self.assertNotIn('<body>', article.html)
326+ self.assertNotIn('&lt;body&gt;', article.html)
327+
328+
329+class TestNoneInURLAfterImport(TestLocalBranchImport):
330+ def runTest(self):
331+ self.create_repo('data/snapcraft-test')
332+ self.repo.add_directive('docs', '')
333+ self.assertEqual(len(self.repo.directives), 1)
334+ self.assertTrue(self.repo.execute_import_directives())
335+ self.assertGreater(len(self.repo.imported_articles), 3)
336+ self.assertTrue(self.repo.publish())
337+ pages = Page.objects.all()
338+ self.assertGreater(pages.count(), len(self.repo.imported_articles))
339+ for article in self.repo.imported_articles:
340+ self.assertIsInstance(article, Article)
341+ self.assertNotIn('/None/', article.full_url)
342+ for page in pages:
343+ self.assertIsNotNone(page.get_slug())
344+
345+
346 class TestTwiceImport(TestLocalBranchImport):
347 '''Run import on the same contents twice, make sure we don't
348 add new pages over and over again.'''
349@@ -101,9 +135,9 @@
350 self.assertEqual(
351 Page.objects.filter(publisher_is_draft=False).count(),
352 len(self.repo.imported_articles)+1) # articles + root
353+ shutil.rmtree(self.tempdir)
354 # Take the time before publishing the second import
355 now = datetime.now(pytz.utc)
356- shutil.rmtree(self.tempdir)
357 # Run second import
358 self.create_repo('data/snapcraft-test')
359 self.repo.add_directive('docs', '')
360@@ -112,7 +146,7 @@
361 self.assertTrue(self.repo.execute_import_directives())
362 self.assertTrue(self.repo.publish())
363 # Check the page's plugins
364- for plugin_change in CMSPlugin.objects.filter(
365- plugin_type='RawHtmlPlugin').order_by(
366- '-changed_date'):
367- self.assertGreater(now, plugin_change.changed_date)
368+ for page in Page.objects.filter(publisher_is_draft=False):
369+ if page != self.root:
370+ (dummy, plugin) = find_text_plugin(page)
371+ self.assertGreater(now, plugin.changed_date)
372
373=== modified file 'md_importer/tests/test_link_rewrite.py'
374--- md_importer/tests/test_link_rewrite.py 2016-01-11 14:38:51 +0000
375+++ md_importer/tests/test_link_rewrite.py 2016-03-02 11:13:56 +0000
376@@ -30,6 +30,11 @@
377 link.attrs['href'],
378 ', '.join([p.get_absolute_url() for p in pages])))
379 self.assertIn(page, pages)
380+ if article.slug == 'file1':
381+ for link in soup.find_all('a'):
382+ if not link.has_attr('class') or \
383+ 'headeranchor-link' not in link.attrs['class']:
384+ self.assertEqual(link.attrs['href'], '/file2')
385
386
387 class TestLinkBrokenRewrite(TestLocalBranchImport):
388@@ -45,12 +50,34 @@
389 self.assertEqual(article.page.parent, self.root)
390 soup = BeautifulSoup(article.html, 'html5lib')
391 for link in soup.find_all('a'):
392- if link.has_attr('class') and \
393- 'headeranchor-link' in link.attrs['class']:
394- break
395- page = self.check_local_link(link.attrs['href'])
396- self.assertIsNone(page)
397- self.assertNotIn(page, pages)
398+ if not link.has_attr('class') or \
399+ 'headeranchor-link' not in link.attrs['class']:
400+ page = self.check_local_link(link.attrs['href'])
401+ self.assertIsNone(page)
402+ self.assertNotIn(page, pages)
403+
404+
405+class TestNoneNotInLinks(TestLocalBranchImport):
406+ def runTest(self):
407+ self.create_repo('data/snapcraft-test')
408+ snappy_page = db_add_empty_page('Snappy', self.root)
409+ self.assertFalse(snappy_page.publisher_is_draft)
410+ build_apps = db_add_empty_page('Build Apps', snappy_page)
411+ self.assertFalse(build_apps.publisher_is_draft)
412+ self.assertEqual(
413+ 3, Page.objects.filter(publisher_is_draft=False).count())
414+ self.repo.add_directive('docs/intro.md', 'snappy/build-apps/current')
415+ self.repo.add_directive('docs', 'snappy/build-apps/current')
416+ self.assertTrue(self.repo.execute_import_directives())
417+ self.assertTrue(self.repo.publish())
418+ pages = Page.objects.all()
419+ for article in self.repo.imported_articles:
420+ self.assertTrue(isinstance(article, Article))
421+ self.assertGreater(len(article.html), 0)
422+ soup = BeautifulSoup(article.html, 'html5lib')
423+ for link in soup.find_all('a'):
424+ if is_local_link(link):
425+ self.assertFalse(link.attrs['href'].startswith('/None/'))
426
427
428 class TestSnapcraftLinkRewrite(TestLocalBranchImport):
429@@ -62,25 +89,21 @@
430 self.assertFalse(build_apps.publisher_is_draft)
431 self.assertEqual(
432 3, Page.objects.filter(publisher_is_draft=False).count())
433- self.repo.add_directive('docs', 'snappy/build-apps/devel')
434- self.repo.add_directive('README.md', 'snappy/build-apps/devel')
435- self.repo.add_directive(
436- 'HACKING.md', 'snappy/build-apps/devel/hacking')
437+ self.repo.add_directive('docs/intro.md', 'snappy/build-apps/current')
438+ self.repo.add_directive('docs', 'snappy/build-apps/current')
439 self.assertTrue(self.repo.execute_import_directives())
440 self.assertTrue(self.repo.publish())
441 pages = Page.objects.all()
442 for article in self.repo.imported_articles:
443 self.assertTrue(isinstance(article, Article))
444 self.assertGreater(len(article.html), 0)
445- for article in self.repo.imported_articles:
446 soup = BeautifulSoup(article.html, 'html5lib')
447 for link in soup.find_all('a'):
448- if not is_local_link(link):
449- break
450- page = self.check_local_link(link.attrs['href'])
451- self.assertIsNotNone(
452- page,
453- msg='Link {} not found. Available pages: {}'.format(
454- link.attrs['href'],
455- ', '.join([p.get_absolute_url() for p in pages])))
456- self.assertIn(page, pages)
457+ if is_local_link(link):
458+ page = self.check_local_link(link.attrs['href'])
459+ self.assertIsNotNone(
460+ page,
461+ msg='Link {} not found. Available pages: {}'.format(
462+ link.attrs['href'],
463+ ', '.join([p.get_absolute_url() for p in pages])))
464+ self.assertIn(page, pages)
465
466=== modified file 'md_importer/tests/utils.py'
467--- md_importer/tests/utils.py 2016-01-11 14:38:51 +0000
468+++ md_importer/tests/utils.py 2016-03-02 11:13:56 +0000
469@@ -55,8 +55,10 @@
470 self.assertEqual(self.fetch_retcode, 0)
471
472 def check_local_link(self, url):
473+ if not url.startswith('/'):
474+ url = '/' + url
475 if not url.startswith('/{}/'.format(DEFAULT_LANG)):
476- url = '/{}/{}/'.format(DEFAULT_LANG, url)
477+ url = '/{}'.format(DEFAULT_LANG) + url
478 request = self.get_request(url)
479 page = get_page_from_request(request)
480 return page

Subscribers

People subscribed via source and target branches