Merge lp:~mvo/pkgme-devportal/less-silly-contents-file-based-db-backend into lp:~mvo/pkgme-devportal/run-specific-tests

Proposed by Michael Vogt
Status: Superseded
Proposed branch: lp:~mvo/pkgme-devportal/less-silly-contents-file-based-db-backend
Merge into: lp:~mvo/pkgme-devportal/run-specific-tests
Diff against target: 386 lines (+271/-11)
5 files modified
README (+17/-8)
devportalbinary/configuration.py (+5/-3)
devportalbinary/database.py (+159/-0)
devportalbinary/tests/data/apt-file-backend/Contents-oneiric-i386 (+47/-0)
devportalbinary/tests/test_database.py (+43/-0)
To merge this branch: bzr merge lp:~mvo/pkgme-devportal/less-silly-contents-file-based-db-backend
Reviewer Review Type Date Requested Status
Jonathan Lange (community) Approve
Michael Vogt Pending
Review via email: mp+119874@code.launchpad.net

This proposal has been superseded by a proposal from 2012-08-22.

Description of the change

This branch makes the apt-file/Conents-$(arch) backend more reasonable
and most importantly fast. With that we could use it to run the scoreboard
once with the regular DB backend and once with the aptfile backend.

To post a comment you must log in.
Revision history for this message
Jonathan Lange (jml) wrote :

Thanks Michael.

The code is very clear and well-structured. I'd draw the lines in slightly different places, but I don't think they are necessary for this to land, or even necessary at all. "Approve with tweaks"

Minor typos:

 * iter_contents_file has a typo in its header, "wiht" vs "with"

 * _download_and_prepare_contents_file_if_needed has a typo in its docstring, "Conents" vs "Contents"

 * "interessted" vs "interested" in comments in same function

Some other suggestions:

 * _get_lib_to_pkgs_mapping
   * 'cachekey' seems unnecessary. Why not use a tuple as the key?
   * Python *almost* has something that does this for you (collections.defaultdict). How annoying.

In case you're curious, here's what I think I'd do (or at least would muck around with):

 * iter_contents_file
   * return a set of pkgs, rather than a comma-separated string
   * muck around with having it return a dict, or wrap it in something that returns a dict
   * split out the filtering code. These three together would make it a proper parser.

 * _get_mapping_from_contents_file
   * after those changes to iter_contents_file, this can be expressed in terms of 'filter' and 'reduce' (well, if only Python had dict versions of basic HOFs)

 * _download_and_prepare_contents_file_if_needed
   * extract the bit inside the 'if' into a function (or method) that downloaded & prepared the file
   * make a new method that tried to open the contents file, and then called the download & prepare method if it didn't exist (if e.errno == ENOENT)
   * change _get_mapping_from_contents_file to use that new 'open' method

Basically, I have a prejudice against look-before-you-leap and another prejudice toward functional programming.

cheers,
jml

review: Approve
81. By Michael Vogt

fix typos (thanks to jml)

82. By Michael Vogt

devportalbinary/database.py: use (distroseries, arch) as cachekey instead of string, thanks to jml

Revision history for this message
Michael Vogt (mvo) wrote :

On Thu, Aug 16, 2012 at 12:01:31PM -0000, Jonathan Lange wrote:
> Review: Approve
>
> Thanks Michael.
>
> The code is very clear and well-structured.

Happy to hear that!

> I'd draw the lines in slightly different places, but I don't think they are necessary for this to land, or even necessary at all. "Approve with tweaks"

Thank you for your excellent review (as always :)

> Minor typos:
>
> * iter_contents_file has a typo in its header, "wiht" vs "with"
>
> * _download_and_prepare_contents_file_if_needed has a typo in its docstring, "Conents" vs "Contents"
>
> * "interessted" vs "interested" in comments in same function

Fixed in bzr now.

> Some other suggestions:
>
> * _get_lib_to_pkgs_mapping
> * 'cachekey' seems unnecessary. Why not use a tuple as the key?

Indeed, fixed too.

> * Python *almost* has something that does this for you
> (collections.defaultdict). How annoying.

:)
>
 In case you're curious, here's what I think I'd do (or at least would muck around with):
[..]
> Basically, I have a prejudice against look-before-you-leap and another prejudice toward functional programming.

I like your suggestions and will play around with them. I'm not sure I
will do all of them, but I want to apply them and see what the code looks
like then :) Unfortunately I did not get to this yet :/ But I will try
again early next week or after feature freeze (latest).

Thanks,
 Michael

83. By Michael Vogt

iter_contents_file yields a set of pkgnames now instead of a comma seperated string

84. By Michael Vogt

use a filter func for iter_contents_file() instead of the contents path hardcoding

85. By Michael Vogt

use the python buildin filter() instead of passing a filter func to iter_contents_file()

Revision history for this message
Michael Vogt (mvo) wrote :

I addressed some more points, its not doing all you suggested, but a reeval would be nice.

Revision history for this message
Jonathan Lange (jml) wrote :

Thanks! I hope you had fun with this.

jml

review: Approve
Revision history for this message
Jonathan Lange (jml) wrote :

I can't approved this for landing because it is proposed into your 'run-specific-tests'. I'm pretty sure tarmac won't pick it up either.

Could you please re-propose against trunk & I'll rubberstamp.

86. By Michael Vogt

fix configuration tests as the new default is the AptFilePackageDatabase backend

87. By Michael Vogt

devportalbinary/tests/test_database.py: use self.assertIsInstance() instead of trying to do the same manually, thanks to jml

Unmerged revisions

87. By Michael Vogt

devportalbinary/tests/test_database.py: use self.assertIsInstance() instead of trying to do the same manually, thanks to jml

86. By Michael Vogt

fix configuration tests as the new default is the AptFilePackageDatabase backend

85. By Michael Vogt

use the python buildin filter() instead of passing a filter func to iter_contents_file()

84. By Michael Vogt

use a filter func for iter_contents_file() instead of the contents path hardcoding

83. By Michael Vogt

iter_contents_file yields a set of pkgnames now instead of a comma seperated string

82. By Michael Vogt

devportalbinary/database.py: use (distroseries, arch) as cachekey instead of string, thanks to jml

81. By Michael Vogt

fix typos (thanks to jml)

80. By Michael Vogt

devportalbinary/tests/test_database.py: also test that headers are skipped

79. By Michael Vogt

add test docstrings

78. By Michael Vogt

another round of refactor/docstring cleanup

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'README'
2--- README 2012-08-15 14:24:43 +0000
3+++ README 2012-08-21 16:41:20 +0000
4@@ -39,7 +39,6 @@
5 You can override the location of this configuration file by setting the
6 environment variable ``PKGME_DEVPORTAL_CONFIG_FILE``.
7
8-
9 Postgres
10 --------
11
12@@ -59,15 +58,25 @@
13 sudo -u postgres psql -f db/postgres_schema.sql
14
15
16+Contents File (apt-file)
17+------------------------
18+Running in this mode is useful for running locally. It will download the
19+archives "Contents-%(architecture)s" file on demand and build a local
20+version of that to map from soname to pkgname.
21+
22+You can run pkgme locally using:
23+ $ cd /path/to/binary/thing/you/want/packaged
24+ $ echo '{ "package_name" : "mypkg", "description" : "some description" }' > devportal-metadata.json
25+ $ PYTHONPATH=/path/to/pkgme:/path/to/pkgme-devport \
26+ PKGME_BACKEND_PATHS=/path/to/pkgme-devportal/devportalbinary/backends/ \
27+ pkgme
28+
29+
30 Acceptance tests
31 ================
32
33 There is an acceptance test suite that is not run as part of the
34-unit tests. You can run these tests in virtualenv with
35-
36- python -m testtools.run acceptance.tests
37-
38-or in the rootdir of the checkout with
39-
40- PYTHONPATH=$(pwd) python -m testtools.run acceptance.tests
41+unit tests. You can run these tests with:
42+
43+ $ make acceptance
44
45
46=== modified file 'devportalbinary/configuration.py'
47--- devportalbinary/configuration.py 2012-07-27 13:16:09 +0000
48+++ devportalbinary/configuration.py 2012-08-21 16:41:20 +0000
49@@ -31,9 +31,9 @@
50
51 # database
52 database = Section()
53- database.db_type = StringOption(default='postgres',
54- help=('The database to use, "postgres" is the only supported '
55- 'value.'))
56+ database.db_type = StringOption(default='aptfile',
57+ help=('The database to use, "postgres", "aptfile" are supported '
58+ 'values.'))
59 database.host = StringOption(default=None,
60 help='The database host (for postgres)')
61 database.port = IntOption(default=None,
62@@ -44,6 +44,8 @@
63 help='The database password (for postgres)')
64 database.db_name = StringOption(default=None,
65 help='The database name (for postgres)')
66+ database.aptfile_cachedir = StringOption(default="~/.cache/pkgme-devportal",
67+ help='The cache directory for the aptfile backend')
68
69 scan_mode = StringOption(
70 help='To scan binary or source packages.')
71
72=== modified file 'devportalbinary/database.py'
73--- devportalbinary/database.py 2012-08-09 00:04:42 +0000
74+++ devportalbinary/database.py 2012-08-21 16:41:20 +0000
75@@ -2,10 +2,12 @@
76 # GNU Affero General Public License version 3 (see the file LICENSE).
77
78 import errno
79+import gzip
80 from itertools import chain
81 import os
82 import shutil
83 import tempfile
84+import urllib
85
86 from bzrlib import urlutils
87 from fixtures import (
88@@ -267,6 +269,24 @@
89 yield library, dependency
90
91
92+def iter_contents_file(contents):
93+ """ Yield (full-library-path, set-of-pkgnames) from a Contents file.
94+
95+ It expects a line starting with "FILE" that tells it when the header ends
96+ and the actual content starts.
97+ """
98+ found_start_marker = False
99+ for line in contents:
100+ if not found_start_marker:
101+ if line.startswith("FILE"):
102+ found_start_marker = True
103+ continue
104+ (path, sep, pkgs) = [s.strip() for s in line.rpartition(" ")]
105+ # pkgs is formated a bit funny, e.g. universe/pkgname
106+ pkgs = set([os.path.basename(pkg) for pkg in pkgs.split(",")])
107+ yield (path, pkgs)
108+
109+
110 class URI(StormURI):
111 """A stand-in for Storm's URI class.
112
113@@ -289,10 +309,146 @@
114 self.options = dict()
115
116
117+class AptFilePackageDatabase(object):
118+ """ Really dumb database that just uses apt-file for local testing """
119+
120+ # we could also read /etc/ld.so.conf.d/*.conf but this maybe different on
121+ # different distroseries especially if
122+ # server-distroseries != target-distroseries
123+ # (I wish there was ldconfig --print-search-dirs)
124+ LD_SEARCH_PATH = [
125+ # standards
126+ "lib",
127+ "usr/lib",
128+ "usr/local/lib",
129+ # old biarch
130+ "lib32",
131+ "usr/lib32",
132+ # new multiarch
133+ "lib/i686-linux-gnu",
134+ "lib/i386-linux-gnu",
135+ "lib/x86_64-linux-gnu",
136+ "usr/lib/i386-linux-gnu",
137+ "usr/lib/i686-linux-gnu",
138+ "usr/lib/x86_64-linux-gnu",
139+ # ?
140+ "usr/lib/x86_64-linux-gnu/fakechroot",
141+ "usr/lib/x86_64-linux-gnu/mesa",
142+ "usr/lib/x86_64-linux-gnu/mesa-egl",
143+ "usr/lib/i386-linux-gnu/mesa",
144+ ]
145+
146+ DISTROSERIES = "oneiric"
147+
148+ CONTENTS_FILE_URL_LOCATION = (
149+ "http://archive.ubuntu.com/ubuntu/dists/%(distroseries)s/"
150+ "Contents-%(arch)s.gz")
151+
152+ CONTENTS_FILE = "Contents-%(distroseries)s-%(arch)s"
153+
154+ def __init__(self, cachedir=None):
155+ self.cachedir = os.path.expanduser(cachedir)
156+ self._distroseries_arch_cache = {}
157+
158+ def _get_lib_to_pkgs_mapping(self, distroseries, arch):
159+ """ Returns a dict of { library-name : set([pkg1,pkg2])
160+
161+ This function will return a dict to lookup library-name to package
162+ dependencies for the given distroseries and architecture
163+ """
164+ if not (distroseries, arch) in self._distroseries_arch_cache:
165+ self._distroseries_arch_cache[(distroseries, arch)] = \
166+ self._get_mapping_from_contents_file(distroseries, arch)
167+ return self._distroseries_arch_cache[(distroseries, arch)]
168+
169+ def _get_contents_file_cache_path(self, distroseries, arch):
170+ """ Return the path in the cache for the given distroseries, arch """
171+ return os.path.join(
172+ self.cachedir, self.CONTENTS_FILE % {
173+ 'distroseries' : distroseries, 'arch' : arch })
174+
175+ def _get_contents_file_server_url(self, distroseries, arch):
176+ """ Return the remote server URL for the given distroseries, arch """
177+ return self.CONTENTS_FILE_URL_LOCATION % {
178+ 'distroseries' : distroseries, 'arch' : arch }
179+
180+ def _get_mapping_from_contents_file(self, distroseries, arch):
181+ """ Return lib,pkgs mapping from contents file for distroseries, arch
182+
183+ This expects the contents file to be in the cachedir already.
184+ """
185+ lib_to_pkgs = {}
186+ path = self._get_contents_file_cache_path(distroseries, arch)
187+ with open(path) as f:
188+ for (path, pkgs) in filter(
189+ lambda (p, pkgs): os.path.dirname(p) in self.LD_SEARCH_PATH,
190+ iter_contents_file(f)):
191+ basename = os.path.basename(path)
192+ if not basename in lib_to_pkgs:
193+ lib_to_pkgs[basename] = set()
194+ lib_to_pkgs[basename] |= pkgs
195+ return lib_to_pkgs
196+
197+ def _download_contents_file_compressed(self, distroseries, arch):
198+ """ Downloads the content file for distroseries, arch into target """
199+ # XXX: we may eventually want to merge the Contents files from
200+ # the -updates repository too in addition to the main archive
201+ url = self._get_contents_file_server_url(distroseries, arch)
202+ target = self._get_contents_file_cache_path(distroseries, arch)
203+ compressed_target = target + os.path.splitext(url)[1]
204+ # download
205+ urllib.urlretrieve(url, compressed_target)
206+ return compressed_target
207+
208+ def _prune_contents_gz_file(self, infile, outfile):
209+ """ Read a compressed Contents.gz and write out a pruned version.
210+
211+ This will use iter_contents_file to go over infile and write
212+ the relevant lines that are in the LD_SEARCH_PATH to outfile.
213+ """
214+ with open(outfile, "w") as outf, gzip.open(infile) as inf:
215+ # first write the header
216+ outf.write("FILE LOCATION\n")
217+ # then iter over all relevant lines and write them out
218+ for (path, pkgs) in filter(
219+ lambda (p,pkgs): os.path.dirname(p) in self.LD_SEARCH_PATH,
220+ iter_contents_file(inf)):
221+ outf.write("%s %s\n" % (path, ",".join(pkgs)))
222+
223+ def _download_and_prepare_contents_file_if_needed(self, distroseries, arch):
224+ """ Ensure there is a usable Contents file in the cachedir
225+
226+ This will download, uncompress and prune a Conents file for
227+ distroseries, arch so that get_dependencies works.
228+ """
229+ # mvo: We can (and should eventually) do etag/if-modified-since
230+ # matching here. But its not really important as long as
231+ # we package for stable distroseries as the Contents file
232+ # will not change
233+ path = self._get_contents_file_cache_path(distroseries, arch)
234+ if not os.path.exists(path):
235+ compressed_contents = self._download_contents_file_compressed(
236+ distroseries, arch)
237+ # and prune from ~300mb to 1mb uncompressed as we are only
238+ # interested in the library path parts
239+ self._prune_contents_gz_file(compressed_contents, path)
240+ os.remove(compressed_contents)
241+
242+ def get_dependencies(self, lib, arch="i386"):
243+ # do lazy downloading for now, we could also make this part
244+ # of bin/fetch-symbols I guess(?)
245+ self._download_and_prepare_contents_file_if_needed(
246+ self.DISTROSERIES, arch)
247+ lib_to_pkgs = self._get_lib_to_pkgs_mapping(self.DISTROSERIES, arch)
248+ return lib_to_pkgs.get(lib)
249+
250+
251 class PackageDatabase(object):
252
253+ # the various db backends, aptfile is a bit special
254 SQLITE = 'sqlite'
255 POSTGRES = 'postgres'
256+ APTFILE = 'aptfile'
257
258 def __init__(self, store):
259 self._store = store
260@@ -350,6 +506,9 @@
261 def create(cls, store=None):
262 if store is None:
263 options = load_configuration().options
264+ # XXX: not elegant
265+ if options.database_db_type == cls.APTFILE:
266+ return AptFilePackageDatabase(options.database_aptfile_cachedir)
267 store = cls.get_store_from_config(options)
268 return cls(store)
269
270
271=== added directory 'devportalbinary/tests/data/apt-file-backend'
272=== added file 'devportalbinary/tests/data/apt-file-backend/Contents-oneiric-i386'
273--- devportalbinary/tests/data/apt-file-backend/Contents-oneiric-i386 1970-01-01 00:00:00 +0000
274+++ devportalbinary/tests/data/apt-file-backend/Contents-oneiric-i386 2012-08-21 16:41:20 +0000
275@@ -0,0 +1,47 @@
276+This file maps each file available in the Ubuntu
277+system to the package from which it originates. It includes packages
278+from the DIST distribution for the ARCH architecture.
279+
280+You can use this list to determine which package contains a specific
281+file, or whether or not a specific file is available. The list is
282+updated weekly, each architecture on a different day.
283+
284+When a file is contained in more than one package, all packages are
285+listed. When a directory is contained in more than one package, only
286+the first is listed.
287+
288+The best way to search quickly for a file is with the Unix `grep'
289+utility, as in `grep <regular expression> CONTENTS':
290+
291+ $ grep nose Contents
292+ etc/nosendfile net/sendfile
293+ usr/X11R6/bin/noseguy x11/xscreensaver
294+ usr/X11R6/man/man1/noseguy.1x.gz x11/xscreensaver
295+ usr/doc/examples/ucbmpeg/mpeg_encode/nosearch.param graphics/ucbmpeg
296+ usr/lib/cfengine/bin/noseyparker admin/cfengine
297+
298+This list contains files in all packages, even though not all of the
299+packages are installed on an actual system at once. If you want to
300+find out which packages on an installed Debian system provide a
301+particular file, you can use `dpkg --search <filename>':
302+
303+ $ dpkg --search /usr/bin/dselect
304+ dpkg: /usr/bin/dselect
305+
306+
307+FILE LOCATION
308+bin/afio universe/utils/afio
309+bin/ash universe/shells/ash
310+bin/autopartition admin/ubiquity
311+bin/autopartition-loop admin/ubiquity
312+bin/bash shells/bash
313+lib/foo.so foo,bar
314+lib/i386-linux-gnu/libx86.so.1 libs/libx86-1
315+lib/i386-linux-gnu/libz.so.1 libs/zlib1g
316+lib/i386-linux-gnu/libz.so.1.2.7 libs/zlib1g
317+lib/i386-linux-gnu/security/pam_access.so admin/libpam-modules
318+lib/i386-linux-gnu/security/pam_afs_session.so universe/net/libpam-afs-session
319+lib/init/fstab admin/mountall
320+lib/init/rw admin/initscripts
321+lib/init/splash-functions-base admin/initscripts
322+lib/init/upstart-job admin/upstart
323
324=== modified file 'devportalbinary/tests/test_database.py'
325--- devportalbinary/tests/test_database.py 2012-07-26 17:51:15 +0000
326+++ devportalbinary/tests/test_database.py 2012-08-21 16:41:20 +0000
327@@ -1,3 +1,4 @@
328+import gzip
329 import os
330
331 from fixtures import (
332@@ -10,8 +11,10 @@
333 Equals,
334 Matcher,
335 )
336+from mock import patch
337
338 from devportalbinary.database import (
339+ AptFilePackageDatabase,
340 load_configuration,
341 PackageDatabase,
342 )
343@@ -135,3 +138,43 @@
344 self.assertEqual(expected_host, uri.host)
345 self.assertEqual(expected_password, uri.password)
346 self.assertEqual(expected_username, uri.username)
347+
348+
349+class AptFilePackageDatabaseTestCase(TestCase):
350+
351+ # point to our local contents file version that is a tad smaller
352+ CONTENTS_CACHE = os.path.join(
353+ os.path.dirname(__file__), "data", "apt-file-backend")
354+
355+ def setUp(self):
356+ super(AptFilePackageDatabaseTestCase, self).setUp()
357+ self.db = AptFilePackageDatabase(self.CONTENTS_CACHE)
358+
359+ def test_read_fixture_contents_worked(self):
360+ """ test that our fixture Contents file works as expected """
361+ # our test DB has 4 entries in the default search path
362+ self.assertEqual(
363+ len(self.db._get_lib_to_pkgs_mapping("oneiric", "i386")), 4)
364+
365+ def test_get_dependencies(self):
366+ """ Test that data from the fixture dependencies file works """
367+ self.assertEqual(
368+ self.db.get_dependencies("libz.so.1"), set(["zlib1g"]))
369+
370+ @patch("urllib.urlretrieve")
371+ def test_lazy_downloading(self, mock_urlretrieve):
372+ """ test that lazy downloading works """
373+ def _put_fixture_contents_file_in_place(url, target):
374+ with gzip.open(target, "w") as f:
375+ f.write("""
376+Some header text that is ignored
377+FILE LOCATION
378+usr/lib/libfoo.so.2 pkgfoo,pkgbar
379+""")
380+ tempdir = self.useFixture(TempDir())
381+ db = AptFilePackageDatabase(tempdir.path)
382+ mock_urlretrieve.side_effect = _put_fixture_contents_file_in_place
383+ self.assertEqual(
384+ db.get_dependencies("libfoo.so.2", arch="i386"),
385+ set(["pkgfoo", "pkgbar"]))
386+ self.assertEqual(len(db._get_lib_to_pkgs_mapping("oneiric", "i386")), 1)

Subscribers

People subscribed via source and target branches

to all changes: