Ubuntu Developer Portal

Merge lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes into lp:developer-ubuntu-com

importer-post-deployment-fixes
Merge into stable

Proposed by Daniel Holbach on 2016-01-28

Status:	Merged
Approved by:	David Callé on 2016-03-21
Approved revision:	212
Merged at revision:	194
Proposed branch:	lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes
Merge into:	lp:developer-ubuntu-com
Diff against target:	480 lines (+189/-72) 7 files modified md_importer/importer/__init__.py (+12/-0) md_importer/importer/article.py (+37/-22) md_importer/importer/publish.py (+54/-22) md_importer/importer/repo.py (+0/-1) md_importer/tests/test_branch_import.py (+40/-6) md_importer/tests/test_link_rewrite.py (+43/-20) md_importer/tests/utils.py (+3/-1)
To merge this branch:	bzr merge lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Ubuntu App Developer site developers		2016-01-28	Pending
Review via email: mp+284309@code.launchpad.net

Description of the change

This is ready to land now.

List of fixes:
- only use a shortlist of markdown extensions
- fix the rewriting of links in articles (links between articles in the
same branch), fix and extend tests
- simplify code somewhat, remove useless bits
- fix stripping of tags like <body>, add tests

lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes updated on 2016-03-02

192. By Daniel Holbach on 2016-01-28: only set page to public_object if it exists
193. By Daniel Holbach on 2016-01-28: remove unnecessary reset of repo.pages
194. By Daniel Holbach on 2016-02-24: merge from trunk
195. By Daniel Holbach on 2016-02-24: - make TestLinkRewrite link check explicit
- fix condition in TestLinkBrokenRewrite to not leave for loop early
196. By Daniel Holbach on 2016-02-24: bring TestSnapcraftLinkRewrite test closer to reality and just import what's important to us, fix condition to not leave for loop early
197. By Daniel Holbach on 2016-02-24: remove unnecessary line
198. By Daniel Holbach on 2016-02-25: avoid 'break' in the for loop
199. By Daniel Holbach on 2016-02-25: break out update_page functionality into separate function
200. By Daniel Holbach on 2016-02-25: store list of local images and if links were rewritten in the article object, use the new update_page function
201. By Daniel Holbach on 2016-02-25: add TODO item, make pyflakes and pep8 happy
202. By Daniel Holbach on 2016-02-25: remove body/html tags after soup.prettify
203. By Daniel Holbach on 2016-02-25: add test to ensure we strip all <body> tags from the imported articles
204. By Daniel Holbach on 2016-02-25: make sure internal links start with '/'
205. By Daniel Holbach on 2016-02-25: fix tests wrt fixed links
206. By Daniel Holbach on 2016-02-25: remove stray print
207. By Daniel Holbach on 2016-02-25: make regexes for stripping body/html/head tags clearer
208. By Daniel Holbach on 2016-02-29: drop pymdownx.headeranchor - it creates problems (order of link attributes gets mixed up depending on which html output we use), and we don't need it on the page
209. By Daniel Holbach on 2016-02-29: - when comparing HTML, always use clean_html from djangocms_text_ckeditor
and soup.prettify, so we're looking at the same output style
- add convenience function find_text_plugin
- check not only if the draft's html has changed, but also if the published
version changed
- update test as well, check only published pages
210. By Daniel Holbach on 2016-03-01: make sure we don't have 'None' as slug for the root node, add tests (one for the URL, one for links in the HTML)
211. By Daniel Holbach on 2016-03-01: cater to the use-case where we just import snappy docs, but have no release_alias (ie current) set
212. By Daniel Holbach on 2016-03-02: fix typo

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Daniel Holbach

Ubuntu App Developer site developers

 === modified file 'md_importer/importer/__init__.py'
 --- md_importer/importer/__init__.py	2016-01-12 11:44:04 +0000
 +++ md_importer/importer/__init__.py	2016-03-02 11:13:56 +0000
@@ -3,3 +3,15 @@
  DEFAULT_LANG = LANGUAGE_CODE
  HOME_PAGE_URL = '/{}/'.format(DEFAULT_LANG)
  SUPPORTED_ARTICLE_TYPES = ['.md', '.html']
++
++# Instead of just using pymdownx.github, we go with these because of
++# https://github.com/facelessuser/pymdown-extensions/issues/11
++MARKDOWN_EXTENSIONS = [
++    'markdown.extensions.tables',
++    'pymdownx.magiclink',
++    'pymdownx.betterem',
++    'pymdownx.tilde',
++    'pymdownx.githubemoji',
++    'pymdownx.tasklist',
++    'pymdownx.superfences',
++]
 === modified file 'md_importer/importer/article.py'
 --- md_importer/importer/article.py	2016-01-15 13:56:34 +0000
 +++ md_importer/importer/article.py	2016-03-02 11:13:56 +0000
@@ -8,9 +8,10 @@
  from . import (
      DEFAULT_LANG,
++    MARKDOWN_EXTENSIONS,
      SUPPORTED_ARTICLE_TYPES,
+ )
--from .publish import get_or_create_page, slugify
++from .publish import get_or_create_page, slugify, update_page
  if sys.version_info.major == 2:
      from urlparse import urlparse
@@ -27,18 +28,18 @@
          self.write_to = slugify(self.fn)
          self.full_url = write_to
          self.slug = os.path.basename(self.full_url)
++        self.links_rewritten = False
++        self.local_images = []
      def _find_local_images(self):
          '''Local images are currently not supported.'''
          soup = BeautifulSoup(self.html, 'html5lib')
--        local_images = []
          for img in soup.find_all('img'):
              if img.has_attr('src'):
                  (scheme, netloc, path, params, query, fragment) = \
                      urlparse(img.attrs['src'])
                  if scheme not in ['http', 'https']:
--                    local_images.extend([img.attrs['src']])
--        return local_images
++                    self.local_images.extend([img.attrs['src']])
      def read(self):
          if os.path.splitext(self.fn)[1] not in SUPPORTED_ARTICLE_TYPES:
@@ -50,13 +51,13 @@
                  self.html = markdown.markdown(
                      f.read(),
                      output_format='html5',
--                    extensions=['pymdownx.github'])
++                    extensions=MARKDOWN_EXTENSIONS)
              elif self.fn.endswith('.html'):
                  self.html = f.read()
--        local_images = self._find_local_images()
--        if local_images:
++        self._find_local_images()
++        if self.local_images:
              logging.error('Found the following local image(s): {}'.format(
--                ', '.join(local_images)
++                ', '.join(self.local_images)
              ))
              return False
          self.title = self._read_title()
@@ -73,10 +74,15 @@
          return slugify(self.fn).replace('-', ' ').title()
      def _remove_body_and_html_tags(self):
--        self.html = re.sub(r"<html>\n\s<body>\n", "", self.html,
--                           flags=re.MULTILINE)
--        self.html = re.sub(r"\s<\/body>\n<\/html>", "", self.html,
--                           flags=re.MULTILINE)
++        for regex in [
++            # These are added by markdown.markdown
++            r'\s*<html>\s*<body>\s*',
++            r'\s*<\/body>\s*<\/html>\s*',
++            # This is added by BeautifulSoup.prettify
++            r'\s*<html>\s*<head>\s*<\/head>\s*<body>\s*',
++        ]:
++            self.html = re.sub(regex, '', self.html,
++                               flags=re.MULTILINE)
      def _use_developer_site_style(self):
          begin = (u"<div class=\"row no-border\">"
@@ -92,7 +98,6 @@
      def replace_links(self, titles, url_map):
          soup = BeautifulSoup(self.html, 'html5lib')
--        change = False
          for link in soup.find_all('a'):
              if not link.has_attr('class') or \
                 'headeranchor-link' not in link.attrs['class']:
@@ -100,10 +105,12 @@
                      if title.endswith(link.attrs['href']) and \
                         link.attrs['href'] != url_map[title].full_url:
                          link.attrs['href'] = url_map[title].full_url
--                        change = True
--        if change:
++                        if not link.attrs['href'].startswith('/'):
++                            link.attrs['href'] = '/' + link.attrs['href']
++                        self.links_rewritten = True
++        if self.links_rewritten:
              self.html = soup.prettify()
--        return change
++            self._remove_body_and_html_tags()
      def add_to_db(self):
          '''Publishes pages in their branch alias namespace.'''
@@ -112,13 +119,19 @@
              html=self.html)
          if not self.page:
              return False
--        self.full_url = self.page.get_absolute_url()
++        self.full_url = re.sub(
++            r'^\/None\/', '/{}/'.format(DEFAULT_LANG),
++            self.page.get_absolute_url())
          return True
      def publish(self):
++        if self.links_rewritten:
++            update_page(self.page, title=self.title, full_url=self.full_url,
++                        menu_title=self.title, html=self.html)
          if self.page.is_dirty(DEFAULT_LANG):
              self.page.publish(DEFAULT_LANG)
--            self.page = self.page.get_public_object()
++            if self.page.get_public_object():
++                self.page = self.page.get_public_object()
          return self.page
@@ -128,14 +141,16 @@
      def read(self):
          if not Article.read(self):
              return False
--        self.release_alias = re.findall(r'snappy/guides/(\S+?)/\S+?',
--                                        self.full_url)[0]
++        matches = re.findall(r'snappy/guides/(\S+?)/\S+?',
++                             self.full_url)
++        if matches:
++            self.release_alias = matches[0]
          self._make_snappy_mods()
          return True
      def _make_snappy_mods(self):
          # Make sure the reader knows which documentation she is browsing
--        if self.release_alias != 'current':
++        if self.release_alias and self.release_alias != 'current':
              before = (u"<div class=\"row no-border\">\n"
                        "<div class=\"eight-col\">\n")
              after = (u"<div class=\"row no-border\">\n"
@@ -158,6 +173,6 @@
                  redirect="/snappy/guides/current/{}".format(self.slug))
              if not page:
                  return False
--        else:
++        elif self.release_alias:
              self.title += " (%s)" % (self.release_alias,)
          return Article.add_to_db(self)
 === modified file 'md_importer/importer/publish.py'
 --- md_importer/importer/publish.py	2016-01-15 13:58:39 +0000
 +++ md_importer/importer/publish.py	2016-03-02 11:13:56 +0000
@@ -4,11 +4,18 @@
  from cms.models import Title
  from djangocms_text_ckeditor.html import clean_html
++from bs4 import BeautifulSoup
  import logging
  import re
  import os
++def _compare_html(html_a, html_b):
++    soup_a = BeautifulSoup(html_a, 'html5lib')
++    soup_b = BeautifulSoup(html_b, 'html5lib')
++    return (clean_html(soup_a.prettify()) == clean_html(soup_b.prettify()))
++
++
  def slugify(filename):
      return os.path.basename(filename).replace('.md', '').replace('.html', '')
@@ -32,6 +39,51 @@
      return parent_pages[0].page
++def find_text_plugin(page):
++    # We create the page, so we know there's just one placeholder
++    placeholder = page.placeholders.all()[0]
++    if placeholder.get_plugins():
++        return (
++            placeholder,
++            placeholder.get_plugins()[0].get_plugin_instance()[0]
++        )
++    return (placeholder, None)
++
++
++def update_page(page, title, full_url, menu_title=None,
++                in_navigation=True, redirect=None, html=None):
++    if page.get_title() != title:
++        page.title = title
++    if page.get_menu_title() != menu_title:
++        page.menu_title = menu_title
++    if page.in_navigation != in_navigation:
++        page.in_navigation = in_navigation
++    if page.get_redirect() != redirect:
++        page.redirect = redirect
++    if html:
++        update = True
++        (placeholder, plugin) = find_text_plugin(page)
++        if plugin:
++            if _compare_html(html, plugin.body):
++                update = False
++            elif page.get_public_object():
++                (dummy, published_plugin) = \
++                    find_text_plugin(page.get_public_object())
++                if published_plugin:
++                    if _compare_html(html, published_plugin.body):
++                        update = False
++            if update:
++                plugin.body = html
++                plugin.save()
++            else:
++                # Reset draft
++                page.get_draft_object().revert(DEFAULT_LANG)
++        else:
++            add_plugin(
++                placeholder, 'RawHtmlPlugin',
++                DEFAULT_LANG, body=html)
++
++
  def get_or_create_page(title, full_url, menu_title=None,
                         in_navigation=True, redirect=None, html=None):
      # First check if pages already exist.
@@ -39,26 +91,8 @@
          path__regex=full_url).filter(publisher_is_draft=True)
      if pages:
          page = pages[0].page
--        if page.get_title() != title:
--            page.title = title
--        if page.get_menu_title() != menu_title:
--            page.menu_title = menu_title
--        if page.in_navigation != in_navigation:
--            page.in_navigation = in_navigation
--        if page.get_redirect() != redirect:
--            page.redirect = redirect
--        if html:
--            # We create the page, so we know there's just one placeholder
--            placeholder = page.placeholders.all()[0]
--            if placeholder.get_plugins():
--                plugin = placeholder.get_plugins()[0].get_plugin_instance()[0]
--                if plugin.body != clean_html(html, full=False):
--                    plugin.body = html
--                    plugin.save()
--            else:
--                add_plugin(
--                    placeholder, 'RawHtmlPlugin',
--                    DEFAULT_LANG, body=html)
++        update_page(page, title, full_url, menu_title, in_navigation,
++                    redirect, html)
      else:
          parent = _find_parent(full_url)
          if not parent:
@@ -70,6 +104,4 @@
              position='last-child', redirect=redirect)
          placeholder = page.placeholders.get()
          add_plugin(placeholder, 'RawHtmlPlugin', DEFAULT_LANG, body=html)
--        placeholder = page.placeholders.all()[0]
--        plugin = placeholder.get_plugins()[0].get_plugin_instance()[0]
      return page
 === modified file 'md_importer/importer/repo.py'
 --- md_importer/importer/repo.py	2016-01-15 18:54:50 +0000
 +++ md_importer/importer/repo.py	2016-03-02 11:13:56 +0000
@@ -118,7 +118,6 @@
                  logging.error('Publishing of {} aborted.'.format(self.origin))
                  return False
              article.replace_links(self.titles, self.url_map)
--        self.pages = []
          for article in self.imported_articles:
              self.pages.extend([article.publish()])
          if self.index_page:
 === modified file 'md_importer/tests/test_branch_import.py'
 --- md_importer/tests/test_branch_import.py	2016-01-15 13:59:32 +0000
 +++ md_importer/tests/test_branch_import.py	2016-03-02 11:13:56 +0000
@@ -2,9 +2,10 @@
  import pytz
  import shutil
--from cms.models import CMSPlugin, Page
++from cms.models import Page
  from md_importer.importer.article import Article
++from md_importer.importer.publish import find_text_plugin
  from .utils import TestLocalBranchImport
@@ -66,6 +67,39 @@
                  self.assertEqual(page.parent_id, self.root.id)
++class TestArticleHTMLTagsAfterImport(TestLocalBranchImport):
++    def runTest(self):
++        self.create_repo('data/snapcraft-test')
++        self.repo.add_directive('docs', '')
++        self.assertEqual(len(self.repo.directives), 1)
++        self.assertTrue(self.repo.execute_import_directives())
++        self.assertGreater(len(self.repo.imported_articles), 3)
++        self.assertTrue(self.repo.publish())
++        pages = Page.objects.all()
++        self.assertGreater(pages.count(), len(self.repo.imported_articles))
++        for article in self.repo.imported_articles:
++            self.assertIsInstance(article, Article)
++            self.assertNotIn('<body>', article.html)
++            self.assertNotIn('&lt;body&gt;', article.html)
++
++
++class TestNoneInURLAfterImport(TestLocalBranchImport):
++    def runTest(self):
++        self.create_repo('data/snapcraft-test')
++        self.repo.add_directive('docs', '')
++        self.assertEqual(len(self.repo.directives), 1)
++        self.assertTrue(self.repo.execute_import_directives())
++        self.assertGreater(len(self.repo.imported_articles), 3)
++        self.assertTrue(self.repo.publish())
++        pages = Page.objects.all()
++        self.assertGreater(pages.count(), len(self.repo.imported_articles))
++        for article in self.repo.imported_articles:
++            self.assertIsInstance(article, Article)
++            self.assertNotIn('/None/', article.full_url)
++        for page in pages:
++            self.assertIsNotNone(page.get_slug())
++
++
  class TestTwiceImport(TestLocalBranchImport):
      '''Run import on the same contents twice, make sure we don't
         add new pages over and over again.'''
@@ -101,9 +135,9 @@
          self.assertEqual(
              Page.objects.filter(publisher_is_draft=False).count(),
              len(self.repo.imported_articles)+1)  # articles + root
++        shutil.rmtree(self.tempdir)
          # Take the time before publishing the second import
          now = datetime.now(pytz.utc)
--        shutil.rmtree(self.tempdir)
          # Run second import
          self.create_repo('data/snapcraft-test')
          self.repo.add_directive('docs', '')
@@ -112,7 +146,7 @@
          self.assertTrue(self.repo.execute_import_directives())
          self.assertTrue(self.repo.publish())
          # Check the page's plugins
--        for plugin_change in CMSPlugin.objects.filter(
--            plugin_type='RawHtmlPlugin').order_by(
--                '-changed_date'):
--            self.assertGreater(now, plugin_change.changed_date)
++        for page in Page.objects.filter(publisher_is_draft=False):
++            if page != self.root:
++                (dummy, plugin) = find_text_plugin(page)
++                self.assertGreater(now, plugin.changed_date)
 === modified file 'md_importer/tests/test_link_rewrite.py'
 --- md_importer/tests/test_link_rewrite.py	2016-01-11 14:38:51 +0000
 +++ md_importer/tests/test_link_rewrite.py	2016-03-02 11:13:56 +0000
@@ -30,6 +30,11 @@
                          link.attrs['href'],
                          ', '.join([p.get_absolute_url() for p in pages])))
                  self.assertIn(page, pages)
++            if article.slug == 'file1':
++                for link in soup.find_all('a'):
++                    if not link.has_attr('class') or \
++                       'headeranchor-link' not in link.attrs['class']:
++                        self.assertEqual(link.attrs['href'], '/file2')
  class TestLinkBrokenRewrite(TestLocalBranchImport):
@@ -45,12 +50,34 @@
              self.assertEqual(article.page.parent, self.root)
              soup = BeautifulSoup(article.html, 'html5lib')
              for link in soup.find_all('a'):
--                if link.has_attr('class') and \
--                   'headeranchor-link' in link.attrs['class']:
--                    break
--                page = self.check_local_link(link.attrs['href'])
--                self.assertIsNone(page)
--                self.assertNotIn(page, pages)
++                if not link.has_attr('class') or \
++                   'headeranchor-link' not in link.attrs['class']:
++                    page = self.check_local_link(link.attrs['href'])
++                    self.assertIsNone(page)
++                    self.assertNotIn(page, pages)
++
++
++class TestNoneNotInLinks(TestLocalBranchImport):
++    def runTest(self):
++        self.create_repo('data/snapcraft-test')
++        snappy_page = db_add_empty_page('Snappy', self.root)
++        self.assertFalse(snappy_page.publisher_is_draft)
++        build_apps = db_add_empty_page('Build Apps', snappy_page)
++        self.assertFalse(build_apps.publisher_is_draft)
++        self.assertEqual(
++            3, Page.objects.filter(publisher_is_draft=False).count())
++        self.repo.add_directive('docs/intro.md', 'snappy/build-apps/current')
++        self.repo.add_directive('docs', 'snappy/build-apps/current')
++        self.assertTrue(self.repo.execute_import_directives())
++        self.assertTrue(self.repo.publish())
++        pages = Page.objects.all()
++        for article in self.repo.imported_articles:
++            self.assertTrue(isinstance(article, Article))
++            self.assertGreater(len(article.html), 0)
++            soup = BeautifulSoup(article.html, 'html5lib')
++            for link in soup.find_all('a'):
++                if is_local_link(link):
++                    self.assertFalse(link.attrs['href'].startswith('/None/'))
  class TestSnapcraftLinkRewrite(TestLocalBranchImport):
@@ -62,25 +89,21 @@
          self.assertFalse(build_apps.publisher_is_draft)
          self.assertEqual(
 , Page.objects.filter(publisher_is_draft=False).count())
--        self.repo.add_directive('docs', 'snappy/build-apps/devel')
--        self.repo.add_directive('README.md', 'snappy/build-apps/devel')
--        self.repo.add_directive(
--            'HACKING.md', 'snappy/build-apps/devel/hacking')
++        self.repo.add_directive('docs/intro.md', 'snappy/build-apps/current')
++        self.repo.add_directive('docs', 'snappy/build-apps/current')
          self.assertTrue(self.repo.execute_import_directives())
          self.assertTrue(self.repo.publish())
          pages = Page.objects.all()
          for article in self.repo.imported_articles:
              self.assertTrue(isinstance(article, Article))
              self.assertGreater(len(article.html), 0)
--        for article in self.repo.imported_articles:
              soup = BeautifulSoup(article.html, 'html5lib')
              for link in soup.find_all('a'):
--                if not is_local_link(link):
--                    break
--                page = self.check_local_link(link.attrs['href'])
--                self.assertIsNotNone(
--                    page,
--                    msg='Link {} not found. Available pages: {}'.format(
--                        link.attrs['href'],
--                        ', '.join([p.get_absolute_url() for p in pages])))
--                self.assertIn(page, pages)
++                if is_local_link(link):
++                    page = self.check_local_link(link.attrs['href'])
++                    self.assertIsNotNone(
++                        page,
++                        msg='Link {} not found. Available pages: {}'.format(
++                            link.attrs['href'],
++                            ', '.join([p.get_absolute_url() for p in pages])))
++                    self.assertIn(page, pages)
 === modified file 'md_importer/tests/utils.py'
 --- md_importer/tests/utils.py	2016-01-11 14:38:51 +0000
 +++ md_importer/tests/utils.py	2016-03-02 11:13:56 +0000
@@ -55,8 +55,10 @@
          self.assertEqual(self.fetch_retcode, 0)
      def check_local_link(self, url):
++        if not url.startswith('/'):
++            url = '/' + url
          if not url.startswith('/{}/'.format(DEFAULT_LANG)):
--            url = '/{}/{}/'.format(DEFAULT_LANG, url)
++            url = '/{}'.format(DEFAULT_LANG) + url
          request = self.get_request(url)
          page = get_page_from_request(request)
          return page

Ubuntu Developer Portal

Merge lp:~dholbach/developer-ubuntu-com/importer-post-deployment-fixes into lp:developer-ubuntu-com

Commit message

Description of the change

Preview Diff

Subscribers