Dosage

Merge lp:~dosage-dev/dosage/bunch-of-comics into lp:~dosage-dev/dosage/old

bunch-of-comics
Merge into old

Proposed by Tristan Seligmann on 2010-01-03

Status:

Merged

Approved by:

Jonathan Jacobs on 2010-01-05

Approved revision:

not available

Merged at revision:

not available

Proposed branch:

lp:~dosage-dev/dosage/bunch-of-comics

Merge into:

lp:~dosage-dev/dosage/old

Diff against target:

377 lines (+172/-10)

11 files modified

.bzrignore (+1/-0)
dosage/plugins/a.py (+19/-3)
dosage/plugins/b.py (+18/-0)
dosage/plugins/c.py (+12/-1)
dosage/plugins/g.py (+11/-1)
dosage/plugins/h.py (+21/-0)
dosage/plugins/keenspot.py (+1/-0)
dosage/plugins/l.py (+18/-0)
dosage/plugins/w.py (+1/-0)
dosage/test/test_util.py (+29/-2)
dosage/util.py (+41/-3)

To merge this branch:

bzr merge lp:~dosage-dev/dosage/bunch-of-comics

Related bugs:

Bug #482039: Add comic KeenSpot/FaultyLogic	Low	Fix Committed
Bug #482045: Crimson Darc Module	Low	Fix Committed
Bug #482046: &amp entity fix	High	Fix Committed
Bug #482047: Anarchy Seek and Destroy Module	Low	Fix Committed
Bug #482048: ClownSamurai module	Low	Fix Committed
Bug #482049: LegoRobot Module	Low	Fix Committed
Bug #482050: Hate Song Module	Low	Fix Committed
Bug #482051: Horrible Ville Module	Low	Fix Committed
Bug #482053: Bee Power Module	Low	Fix Committed
Bug #482054: Gunshow Module	Low	Fix Committed
Bug #482060: Bellen! Module	Low	Fix Committed
Bug #482073: Least I Could Do Module	Low	Fix Committed

Link a bug report

Reviewer	Review Type	Date Requested	Status
Jonathan Jacobs		2010-01-03	Approve on 2010-01-05
Review via email: mp+16758@code.launchpad.net

Revision history for this message

Jonathan Jacobs (jjacobs) wrote on 2010-01-03:

  1. There are a number of coding style infractions:
    * Only 2 lines between top-level suites.
    * Lines (not regular expressions) longer than 80 columns.
  2. The regular expression for HateSong.prevSearch can probably be simplified to use exact lengths.
  3. Bellen.imageSearch has an odd regular expression, spaces seem like the kind of thing that would appear in a "src" attribute.
  4. Effbot has a reasonable HTML entity decoder implementation[1] that could be the start of a better normalizeUrl implementation.

[1] http://effbot.org/zone/re-sub.htm#unescape-html

review: Needs Fixing

lp:~dosage-dev/dosage/bunch-of-comics updated on 2010-01-04

616. By Tristan Seligmann on 2010-01-04: Different regex for Bellen.
617. By Tristan Seligmann on 2010-01-04: More comprehensive quoting thingy.
618. By Tristan Seligmann on 2010-01-04: Fix some coding style issues.
619. By Tristan Seligmann on 2010-01-04: Tweak regex.

Revision history for this message

Tristan Seligmann (mithrandi) wrote on 2010-01-04:

Okay, think I've fixed all of those.

Revision history for this message

Jonathan Jacobs (jjacobs) on 2010-01-05:

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Tristan Seligmann

Dosage

Merge lp:~dosage-dev/dosage/bunch-of-comics into lp:~dosage-dev/dosage/old

Commit message

Description of the change

Preview Diff

Subscribers

 === added file '.bzrignore'
 --- .bzrignore	1970-01-01 00:00:00 +0000
 +++ .bzrignore	2010-01-04 20:15:24 +0000
@@ -0,0 +1,1 @@
++dropin.cache
 === modified file 'dosage/plugins/a.py'
 --- dosage/plugins/a.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/a.py	2010-01-04 20:15:24 +0000
@@ -1,6 +1,7 @@
  from re import compile, MULTILINE
--from dosage.helpers import BasicScraper, regexNamer, bounceStarter
++from dosage.helpers import (
++    BasicScraper, regexNamer, bounceStarter, indirectStarter)
  class ALessonIsLearned(BasicScraper):
@@ -67,6 +68,18 @@
      help = 'Index format: nnn'
++
++class AnarchySD(BasicScraper):
++    imageUrl = 'http://www.anarchycomic.com/page%s.php'
++    imageSearch = compile(r'<img.+src="../(images/page\d+\..+?)"')
++    prevSearch = compile(r'<a href="(page\d+\.php)">PREVIOUS PAGE')
++    help = 'Index format: n (unpadded)'
++    starter = indirectStarter(
++        'http://www.anarchycomic.com/page1.php',
++        compile(r'<a href="(page\d+\.php)" class="style15">LATEST'))
++
++
++
  class Altermeta(BasicScraper):
      latestUrl = 'http://www.altermeta.com/'
      imageUrl = 'http://www.altermeta.com/index.php?PS=viewComic.php&comic=%s'
@@ -125,11 +138,14 @@
  class AstronomyPOTD(BasicScraper):
--    starter = bounceStarter('http://antwrp.gsfc.nasa.gov/apod/astropix.html', compile(r'<a href="(ap\d{6}\.html)">&gt;</a>'))
++    starter = bounceStarter(
++        'http://antwrp.gsfc.nasa.gov/apod/astropix.html',
++        compile(r'<a href="(ap\d{6}\.html)">&gt;</a>'))
      imageUrl = 'http://antwrp.gsfc.nasa.gov/apod/ap%s.html'
      imageSearch = compile(r'<a href="(image/\d{4}/.+\..+?)">')
      prevSearch = compile(r'<a href="(ap\d{6}\.html)">&lt;</a>')
      help = 'Index format: yymmdd'
      def namer(cls, imageUrl, pageUrl):
--        return '%s-%s' % (pageUrl.split('/')[-1].split('.')[0][2:], imageUrl.split('/')[-1].split('.')[0])
++        return '%s-%s' % (pageUrl.split('/')[-1].split('.')[0][2:],
++                          imageUrl.split('/')[-1].split('.')[0])
 === modified file 'dosage/plugins/b.py'
 --- dosage/plugins/b.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/b.py	2010-01-04 20:15:24 +0000
@@ -169,3 +169,21 @@
  starslipCrisis = blankLabel('StarslipCrisis', 'http://www.starslipcrisis.com/')
  uglyHill = blankLabel('UglyHill', 'http://www.uglyhill.com/')
  wapsiSquare = blankLabel('WapsiSquare', 'http://www.wapsisquare.com/')
++
++
++
++class BeePower(BasicScraper):
++    latestUrl = 'http://comicswithoutviolence.com/d/20080713.html'
++    imageUrl = 'http://comicswithoutviolence.com/d/%s.html'
++    imageSearch = compile(r'src="(/comics/.+?)"')
++    prevSearch = compile(r'(\d+\.html)"><img[^>]+?src="/images/previous_day.png"')
++    help = 'Index format: yyyy/mm/dd'
++
++
++
++class Bellen(BasicScraper):
++    latestUrl = 'http://boxbrown.com/'
++    imageUrl = 'http://boxbrown.com/?p=%s'
++    imageSearch = compile(r'<img src="(http://boxbrown.com/comics/[^"]+)"')
++    prevSearch = compile(r'<a href="(.+?)"><span class="prev">')
++    help = 'Index format: nnn'
 === modified file 'dosage/plugins/c.py'
 --- dosage/plugins/c.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/c.py	2010-01-04 20:15:24 +0000
@@ -1,6 +1,7 @@
  from re import compile
--from dosage.helpers import BasicScraper, constStarter, bounceStarter, indirectStarter
++from dosage.helpers import (
++    BasicScraper, constStarter, bounceStarter, indirectStarter
  from dosage.util import getQueryParams
@@ -303,9 +304,19 @@
  zhi = creators('ZackHill', 'zhi')
++
  class CyanideAndHappiness(BasicScraper):
      latestUrl = 'http://www.explosm.net/comics'
      imageUrl = 'http://www.explosm.net/comics/%s'
      imageSearch = compile(r'<img alt="Cyanide and Happiness, a daily webcomic" src="(http:\/\/www\.explosm\.net/db/files/Comics/\w+/\S+\.\w+)"')
      prevSearch = compile(r'<a href="(/comics/\d+/?)">< Previous</a>')
      help = 'Index format: n (unpadded)'
++
++
++
++class CrimsonDark(BasicScraper):
++    latestUrl = 'http://www.davidcsimon.com/crimsondark/'
++    imageUrl = 'http://www.davidcsimon.com/crimsondark/index.php?view=comic&strip_id=%s'
++    imageSearch = compile(r'src="(.+?strips/.+?)"')
++    prevSearch = compile(r'<a href=[\'"](/crimsondark/index\.php\?view=comic&amp;strip_id=\d+)[\'"]><img src=[\'"]themes/cdtheme/images/active_prev.png[\'"]')
++    help = 'Index format: n (unpadded)'
 === modified file 'dosage/plugins/g.py'
 --- dosage/plugins/g.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/g.py	2010-01-04 20:15:24 +0000
@@ -83,9 +83,19 @@
      help = 'Index format: n'
--class GunnerkrigCourt(BasicScraper):
++
++class GunnerkrigCourt(BasicScraper):
      latestUrl = 'http://www.gunnerkrigg.com/index2.php'
      imageUrl = 'http://www.gunnerkrigg.com/archive_page.php\?comicID=%s'
      imageSearch = compile(r'<img src="(.+?//comics/.+?)"')
      prevSearch = compile(r'<.+?(/archive_page.php\?comicID=.+?)".+?prev')
      help = 'Index format: n'
++
++
++
++class Gunshow(BasicScraper):
++    latestUrl = 'http://gunshowcomic.com/'
++    imageUrl = 'http://gunshowcomic.com/d/%s.html'
++    imageSearch = compile(r'src="(/comics/.+?)"')
++    prevSearch = compile(r'(/d/\d+\.html)"><img[^>]+?src="/images/previous_day')
++    help = 'Index format: yyyy/mm/dd'
 === modified file 'dosage/plugins/h.py'
 --- dosage/plugins/h.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/h.py	2010-01-04 20:15:24 +0000
@@ -3,6 +3,7 @@
  from dosage.helpers import BasicScraper
++
  class HappyMedium(BasicScraper):
      latestUrl = 'http://happymedium.fast-bee.com/'
      imageUrl = 'http://happymedium.fast-bee.com/%s'
@@ -11,6 +12,7 @@
      help = 'Index format: yyyy/mm/chapter-n-page-n'
++
  class Heliothaumic(BasicScraper):
      latestUrl = 'http://thaumic.net/'
      imageUrl = 'http://thaumic.net/%s'
@@ -19,9 +21,28 @@
      help = 'Index format: yyyy/mm/dd/n(unpadded)-comicname'
++
  class Housd(BasicScraper):
      latestUrl = 'http://www.housd.net/'
      imageUrl = 'http://housd.net/archive_page.php?comicID=%s'
      imageSearch = compile(r'"(.+?/comics/.+?)"')
      prevSearch = compile(r'"(h.+?comicID=.+?)".+?prev')
      help = 'Index format: nnnn'
++
++
++
++class HateSong(BasicScraper):
++    latestUrl = 'http://hatesong.com/'
++    imageUrl = 'http://hatesong.com/%s/'
++    imageSearch = compile(r'src="(http://www.hatesong.com/strips/.+?)"')
++    prevSearch = compile(r'<div class="headernav"><a href="(http://hatesong.com/\d{4}/\d{2}/\d{2})')
++    help = 'Index format: yyyy/mm/dd'
++
++
++
++class HorribleVille(BasicScraper):
++    latestUrl = 'http://horribleville.com/d/20090517.html'
++    imageUrl = 'http://horribleville.com/d/%s.html'
++    imageSearch = compile(r'src="(/comics/.+?)"')
++    prevSearch = compile(r'(\d+\.html)"><img[^>]+?src="/images/previous_day.png"')
++    help = 'Index format: yyyy/mm/dd'
 === modified file 'dosage/plugins/keenspot.py'
 --- dosage/plugins/keenspot.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/keenspot.py	2010-01-04 20:15:24 +0000
@@ -456,6 +456,7 @@
      'FanserviceMeteorologyWin': 'http://aod.comicgenesis.com/',
      'FantasticalBestiary': 'http://fantasticalbestiary.comicgenesis.com/',
      'FantasyQwest': 'http://creatorauthorman.comicgenesis.com/',
++    'FaultyLogic': 'http://faultylogic.comicgenesis.com/',
      'Feathers': 'http://feathers.comicgenesis.com/',
      'FelixAndTheKidneyEater': 'http://fnk.comicgenesis.com/',
      'Fellonist': 'http://thefellonist.comicgenesis.com/',
 === modified file 'dosage/plugins/l.py'
 --- dosage/plugins/l.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/l.py	2010-01-04 20:15:24 +0000
@@ -98,3 +98,21 @@
  #    prevSearch=compile(r'<a href="(index.php\?comicid=\d+)"><img src="/images/gprev.gif"', IGNORECASE),
  #    help='Index format: n (unpadded)',
  #    namer=queryNamer('comicid'))
++
++
++
++class LegoRobot(BasicScraper):
++    latestUrl = 'http://www.legorobotcomics.com/'
++    imageUrl = 'http://www.legorobotcomics.com/?id=%s'
++    imageSearch = compile(r'id="the_comic" src="(comics/.+?)"')
++    prevSearch = compile(r'(\?id=\d+)"><img src="images/back.png"')
++    help = 'Index format: nnnn'
++
++
++
++class LeastICouldDo(BasicScraper):
++    latestUrl = 'http://www.leasticoulddo.com/'
++    imageUrl = 'http://www.leasticoulddo.com/comic/%s'
++    imageSearch = compile(r'<img src="(http://cdn.leasticoulddo.com/comics/\d{8}.\w{1,4})" />')
++    prevSearch = compile(r'<a href="(/comic/\d{8})">Previous</a>')
++    help = 'Index format: yyyymmdd'
 === modified file 'dosage/plugins/w.py'
 --- dosage/plugins/w.py	2009-12-15 06:55:27 +0000
 +++ dosage/plugins/w.py	2010-01-04 20:15:24 +0000
@@ -104,6 +104,7 @@
          'NekkoAndJoruba': 'nekkoandjoruba/nekkoandjoruba/',
          'JaxEpoch': 'johngreen/quicken/',
          'QuantumRockOfAges': 'DreamchildNYC/quantum/',
++        'ClownSamurai' : 'qsamurai/clownsamurai/',
+         }
      return dict((name, WebcomicsNation.make('WebcomicsNation/' + name, latestUrl='http://www.webcomicsnation.com/' + subpath)) for name, subpath in comics.iteritems())
 === modified file 'dosage/test/test_util.py'
 --- dosage/test/test_util.py	2009-12-15 06:55:27 +0000
 +++ dosage/test/test_util.py	2010-01-04 20:15:24 +0000
@@ -1,6 +1,8 @@
  from twisted.trial.unittest import TestCase
--from dosage.util import saneDataSize
++from dosage.util import saneDataSize, normaliseURL, _unescape
++
++
  class SizeFormattingTests(TestCase):
      """
@@ -15,6 +17,7 @@
          self.assertEqual(saneDataSize(size), expectedOutput)
          self.assertEqual(saneDataSize(-size), '-' + expectedOutput)
++
      def test_verySmallSize(self):
          """
          Sizes smaller than a single byte should be formatted as bytes; this
@@ -22,12 +25,13 @@
          """
          self.check(0.1, '0.100 B')
++
      def test_normalSizes(self):
          """
          Sizes should be formatted in the largest unit for which the size will
          not be less than a single unit.
          """
--        self.check(1, '1.000 B')
++        self.check(1,                 '1.000 B')
          self.check(2.075   * 2 ** 10, '2.075 kB')
          self.check(5.88    * 2 ** 20, '5.880 MB')
          self.check(13.34   * 2 ** 30, '13.340 GB')
@@ -37,8 +41,31 @@
          self.check(57.892  * 2 ** 70, '57.892 ZB')
          self.check(999.99  * 2 ** 80, '999.990 YB')
++
      def test_veryLargeSize(self):
          """
          Sizes larger than 1024 yottabytes should be formatted as yottabytes.
          """
          self.check(5567254 * 2 ** 80, '5567254.000 YB')
++
++
++
++class URLTests(TestCase):
++    """
++    Tests for URL utility functions.
++    """
++    def test_unescape(self):
++        """
++        Test HTML replacement.
++        """
++        self.assertEqual(_unescape('foo&amp;bar'), 'foo&bar')
++        self.assertEqual(_unescape('foo&#160;bar'), 'foo%C2%A0bar')
++        self.assertEqual(_unescape('&quot;foo&quot;'), '%22foo%22')
++
++
++    def test_normalisation(self):
++        """
++        Test URL normalisation.
++        """
++        self.assertEqual(normaliseURL('http://foo.com//bar/baz&amp;baz'),
++                         'http://foo.com/bar/baz&baz')
 === modified file 'dosage/util.py'
 --- dosage/util.py	2009-12-15 06:55:27 +0000
 +++ dosage/util.py	2010-01-04 20:15:24 +0000
@@ -6,6 +6,8 @@
  import array
  import os.path
  import cgi
++import re
++from htmlentitydefs import name2codepoint
  from time import sleep
  from math import log, floor
  from re import compile, IGNORECASE
@@ -73,8 +75,43 @@
      return xformedGroups
--def normalizeUrl(url):
--    '''Removes any leading empty segments to avoid breaking urllib2.'''
++
++def _unescape(text):
++    """
++    Replace HTML entities and character references.
++    """
++    def _fixup(m):
++        text = m.group(0)
++        if text[:2] == "&#":
++            # character reference
++            try:
++                if text[:3] == "&#x":
++                    text = unichr(int(text[3:-1], 16))
++                else:
++                    text = unichr(int(text[2:-1]))
++            except ValueError:
++                pass
++        else:
++            # named entity
++            try:
++                text = unichr(name2codepoint[text[1:-1]])
++            except KeyError:
++                pass
++        if isinstance(text, unicode):
++            text = text.encode('utf-8')
++            text = urllib2.quote(text, safe=';/?:@&=+$,')
++        return text
++    return re.sub("&#?\w+;", _fixup, text)
++
++
++def normaliseURL(url):
++    """
++    Removes any leading empty segments to avoid breaking urllib2; also replaces
++    HTML entities and character references.
++    """
++    # XXX: brutal hack
++    url = _unescape(url)
++
      pu = list(urlparse.urlparse(url))
      segments = pu[2].replace(' ', '%20').split('/')
      while segments and segments[0] == '':
@@ -82,9 +119,10 @@
      pu[2] = '/' + '/'.join(segments)
      return urlparse.urlunparse(pu)
++
  def urlopen(url, referrer=None, retries=5):
      # Work around urllib2 brokenness
--    url = normalizeUrl(url)
++    url = normaliseURL(url)
      req = urllib2.Request(url)
      if referrer:
          req.add_header('Referrer', referrer)