Merge lp:~jjed/archive-crawler/foreign-icons into lp:~mvo/archive-crawler/mvo

Proposed by Jjed
Status: Merged
Merged at revision: 116
Proposed branch: lp:~jjed/archive-crawler/foreign-icons
Merge into: lp:~mvo/archive-crawler/mvo
Diff against target: 231 lines (+102/-10)
4 files modified
ArchiveCrawler/__init__.py (+8/-4)
DesktopDataExtractor/__init__.py (+83/-4)
data/icon_search.cfg (+11/-0)
getMenuData.py (+0/-2)
To merge this branch: bzr merge lp:~jjed/archive-crawler/foreign-icons
Reviewer Review Type Date Requested Status
Michael Vogt Approve
Review via email: mp+57166@code.launchpad.net

Description of the change

This branch adds support for fetching icons for a number of different
storage patterns. For example:

    audacity -> audacity-data
    emacs23 -> emacs23-common
    gnome-terminal -> gnome-icon-theme
    bluedevil-audio -> oxygen-icon-theme
    icedtea-netx-javaws -> openjdk-6 [direct dependency]

It introduces a non-trivial performance overhead due to the number extra
debs it has to search, changing the DesktopDataExtractor's workflow to
this:

    *Nearly all DesktopDataExtractor activities before this branch*
    Search for missing icons for `x` in any package x`-(common|data)
    Search for missing icons for `x` in common icon themes
    Search for missing icons for `x` in `x`s direct, nonlib dependencies
    *calcArchSpecific and addCodecInformation*

More patterns, if needed, can be added to a file "icon_search.cfg"

TESTING

I've tested this on main[a-j] and universe[2-d] (to avoid downloading
all of archive.ubuntu.com). A few icon fetches failed, but there were no
regressions, and all of them make sense as either:

  (b) icons from package dependencies outside the pool I tested
  (c) symbolic links (tarfile doesn't work well with them)
  (a) broken packages

This should be a safe merge: the bulk of my added code executes after
what the script already does is finished.

FUTURE CONSIDERATIONS

* Symlinked icons still fail. This is affecting some GNUstep apps.
* A blacklist of frequent, non-lib dependencies that never have icons
* A hardcoded list for edge-case apps/long, fruitless icon searches
* More regular expressions to filter likely icon packages

To post a comment you must log in.
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks a bunch! this looks excellent. I am currently running it on a full mirror and I can't wait to see what the results.

review: Approve
Revision history for this message
Michael Vogt (mvo) wrote :

I added a small test (based on cheese/cheese-common) and a logging additon to print out the missing icons after the post-processing finder has done its work. It will be interessting to see what is left and if we could utilize something like apt-file to search for the remaining icons.

Btw, I noticed another branch for the crawler, is that still work-in-progress or should I take a look as well :)

Revision history for this message
Jjed (jjed) wrote :

Thanks for the quick merge, hope it works out well.

As for my other branch, I have better-icons mostly finished locally, testing still pending. The branch will default to preferring higher quality icons and scaling them down to a reasonable size (for example, `arista` goes from a 1.8M svg to a 3K png). In using it so far, icons are significantly higher quality with smaller average disk size: pretty nice.

The big question is whether you're willing to add an optional `rsvg` dependency to the script. The preincluded python batteries (PIL) don't work with scalable graphics.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'ArchiveCrawler/__init__.py'
2--- ArchiveCrawler/__init__.py 2010-09-08 08:26:06 +0000
3+++ ArchiveCrawler/__init__.py 2011-04-11 13:39:24 +0000
4@@ -47,6 +47,7 @@
5 self.debfiles_done_database = debfiles_done_database
6 self._loadDebFilesDone()
7 self.callbacks = set()
8+ self.pkgs_to_debpath = {}
9
10 def registerCallback(self, c):
11 if not callable(c):
12@@ -103,10 +104,6 @@
13 if pkgarch != "all" and pkgarch != self.arch:
14 logging.debug("skipping, wrong arch: '%s' (expected '%s')" % (pkgarch, self.arch))
15 return False
16- # ... done already
17- if debfile in self.debfiles_done:
18- logging.debug("skipping, already in debfiles_done '%s'" % debfile)
19- return False
20 # ... by name
21 if not self.cache.has_key(pkgname):
22 logging.debug("skipping, not in cache: '%s'" % pkgname)
23@@ -132,6 +129,13 @@
24 if not debfile.startswith("%s/%s" % (self.pooldir, component)):
25 logging.debug("skipping, compoent does not match (expected '%s' got '%s' "% (component, debfile))
26 return False
27+ # add to mapping of name and deb
28+ # it may be needed if it contains an application's icon
29+ self.pkgs_to_debpath[pkgname] = debfile
30+ # ... then filter if done already
31+ if debfile in self.debfiles_done:
32+ logging.debug("skipping, already in debfiles_done '%s'" % debfile)
33+ return False
34
35 # looks like we have a valid ver
36 logging.debug("found valid deb: '%s'" % debfile)
37
38=== modified file 'DesktopDataExtractor/__init__.py'
39--- DesktopDataExtractor/__init__.py 2011-03-17 15:53:26 +0000
40+++ DesktopDataExtractor/__init__.py 2011-04-11 13:39:24 +0000
41@@ -5,6 +5,7 @@
42 import apt
43 import apt_pkg
44 import apt_inst
45+import collections
46 import os.path
47 import re
48 import tempfile
49@@ -55,6 +56,12 @@
50 # available in certain arches
51 self.pkgs_per_arch = {}
52 self.pkgs_per_arch["all"] = set()
53+ # a mapping of package names to wanted application icons their
54+ # packages don't contain
55+ self.pkgs_to_missing_icons = {}
56+ # regular expressions for finding packages that might contain
57+ # wanted icons
58+ self.iconsearch_regex = []
59 # now read the config
60 self._readConfig()
61
62@@ -76,6 +83,7 @@
63 blacklist_desktop = os.path.join(self.datadir,"blacklist_desktop.cfg")
64 renamecfg = os.path.join(self.datadir,"rename.cfg")
65 annotatecfg = os.path.join(self.datadir,"annotate.cfg")
66+ iconsearchcfg = os.path.join(self.datadir,"icon_search.cfg")
67 if os.path.exists(blacklist):
68 logging.info("using blacklist: '%s'" % blacklist)
69 for line in open(blacklist).readlines():
70@@ -110,6 +118,13 @@
71 annotations = annotations_str.split(",")
72 logging.debug("annotations: '%s': %s" % (desktopfile,annotations))
73 self.desktop_annotate[desktopfile] = annotations
74+ if os.path.exists(iconsearchcfg):
75+ logging.info("using icon search: '%s'" % iconsearchcfg)
76+ for line in open(iconsearchcfg):
77+ line = line.strip()
78+ if line != "" and not line.startswith("#"):
79+ logging.debug("icon search regex: '%s'" % line)
80+ self.iconsearch_regex.append(line)
81
82
83
84@@ -134,11 +149,59 @@
85 self.crawler.updateCache()
86 self.crawler.registerCallback(self.inspectDeb)
87 self.crawler.crawl()
88+ self._findMissingIcons()
89 self._calcArchSpecific()
90 self._addCodecInformation()
91 pickle.dump(self.deb_to_files,open(self.deb_to_files_f,"w"))
92 logging.info("extract() finished")
93-
94+
95+ def _findMissingIcons(self):
96+ """ search for missing desktop icons in using the crawl cache """
97+ for (pkgname, icons) in self.pkgs_to_missing_icons.items():
98+ logging.debug("Searching for missing '%s' icons" % pkgname)
99+ # get an ordered set from most likely to least likely package
100+ to_search = collections.OrderedDict()
101+ # add (in order of importance) all cached packages matching regex
102+ for regex in self.iconsearch_regex:
103+ if '{0}' in regex:
104+ first_term = re.split('-|_', pkgname, 1)[0]
105+ regex = regex.format(first_term)
106+ matches = filter(re.compile(regex).match, self.pkgs_seen)
107+ for match in matches:
108+ to_search[match] = None
109+ # queue all non-library dependencies of the package
110+ deps = self.crawler.cache[pkgname].candidate.dependencies
111+ for dep in deps:
112+ for dep_candidate in dep.or_dependencies:
113+ depname = dep_candidate.name
114+ if not depname.startswith('lib'):
115+ to_search[depname] = None
116+
117+ # finally, search the set of likely packages
118+ for name in to_search:
119+ # get cached tarfile
120+ logging.debug("* Looking in %s" % name)
121+ if name not in self.crawler.pkgs_to_debpath:
122+ logging.debug(" Deb for %s not found!" % name)
123+ continue
124+ try:
125+ debPath = self.crawler.pkgs_to_debpath[name]
126+ datafile = self._extractDebData(debPath)
127+ tar = tarfile.open(datafile)
128+ except:
129+ logging.debug(" Deb for %s could not be opened!" % name)
130+ continue
131+ found = set()
132+ for icon in icons:
133+ (res, n) = self.search_icon(tar, icon, self.menu_data)
134+ if res == True:
135+ logging.debug(" Icon %s found!" % icon)
136+ found.add(icon)
137+ # stop searching for any icons we find
138+ icons = icons.difference(found)
139+ if len(icons) == 0:
140+ break
141+
142 def _calcArchSpecific(self):
143 # now add the architecture information
144 arch_specific = set()
145@@ -258,7 +321,11 @@
146 # but "zapping/zapping.png"
147 if "/" in iconName:
148 newIconName = iconName.replace("/", "_")
149- res = self.extract_icon(tarfile, iconName, os.path.join(outputdir,"icons",newIconName))
150+ outpath = os.path.join(outputdir,"icons",newIconName)
151+ # prevent wasted disk read
152+ if os.path.exists(outpath):
153+ return (True, newIconName)
154+ res = self.extract_icon(tarfile, iconName, outpath)
155 return (res, newIconName)
156
157 # this is the "get-it-from-a-icontheme" case, look into icon-theme hicolor and usr/share/pixmaps
158@@ -277,6 +344,9 @@
159 # extensions (ordered by importance)
160 pixmaps_ext = ["", ".png",".xpm",".svg"]
161
162+ # prevent wasted disk read
163+ if os.path.exists(os.path.join(outputdir,"icons",iconName)):
164+ return (True, None)
165 for d in search_dirs:
166 for name in tarfile.getnames():
167 if d in name:
168@@ -375,6 +445,10 @@
169 iconName = line[line.index("=")+1:]
170 logging.debug("Package '%s' needs icon '%s'" % (pkgname, iconName))
171 (res, newIconName) = self.search_icon(dataFile, iconName, outputdir)
172+ if res == False:
173+ if not pkgname in self.pkgs_to_missing_icons:
174+ self.pkgs_to_missing_icons[pkgname] = set()
175+ self.pkgs_to_missing_icons[pkgname].add(iconName)
176
177 # now check for supicious pkgnames (FIXME: make this not hardcoded)
178 if "-common" in pkgname or "-data" in pkgname:
179@@ -438,6 +512,12 @@
180 """ extract the desktop file and the icons from a deb """
181 outputdir=self.menu_data
182 logging.debug("processing: %s" % debPath)
183+ datafile = self._extractDebData(debPath)
184+ if datafile:
185+ self.getFiles(datafile, pkgname, section, debPath)
186+ os.remove(datafile)
187+
188+ def _extractDebData(self, debPath):
189 datafile = self._getMemberFromAr(debPath, "data.tar.gz")
190 if datafile == None:
191 datafile = self._getMemberFromAr(debPath, "data.tar.bz2")
192@@ -450,8 +530,7 @@
193 # extract it here, python tarfile does not support lzma
194 subprocess.call(["lzma","-d",datafile])
195 datafile = os.path.splitext(datafile)[0]
196- self.getFiles(datafile, pkgname, section, debPath)
197- os.remove(datafile)
198+ return datafile
199
200 def inspectDeb(self, crawler, filename, pkgname, ver, pkgarch, component):
201 """ check if the deb is interessting for us (not blacklisted) """
202
203=== added file 'data/icon_search.cfg'
204--- data/icon_search.cfg 1970-01-01 00:00:00 +0000
205+++ data/icon_search.cfg 2011-04-11 13:39:24 +0000
206@@ -0,0 +1,11 @@
207+# these are regular expressions for finding packages with icons for a
208+# *.desktop file, when icons are not in the application package. Any
209+# string "{0}" will be formatted to the first hyphen-delimited term of
210+# an application package (eg "foo-bar-baz" -> "foo")
211+
212+# for cases like wesnoth/wesnoth-data
213+^{0}.+(data|common)$
214+
215+# for when environment applications use environmental icons
216+gnome-icon-theme
217+oxygen-icon-theme
218
219=== modified file 'getMenuData.py'
220--- getMenuData.py 2008-06-30 12:00:04 +0000
221+++ getMenuData.py 2011-04-11 13:39:24 +0000
222@@ -8,9 +8,7 @@
223 # FIXME: strip "TryExec" from the extracted menu files (and noDisplay)
224 #
225 # TODO:
226-# - emacs21 ships it's icon in emacs-data, deal with this
227 # - some stuff needs to be blacklisted (e.g. gnome-about)
228-# - lots of packages have there desktop file in "-data", "-comon" (e.g. anjuta)
229 # - lots of packages have multiple desktop files for the same application
230 # abiword, abiword-gnome, abiword-gtk
231 #

Subscribers

People subscribed via source and target branches

to all changes: