Merge lp:~michael.nelson/launchpad/432152-orig-file-counts into lp:launchpad

Proposed by Michael Nelson
Status: Merged
Approved by: Gavin Panella
Approved revision: no longer in the source branch.
Merged at revision: not available
Proposed branch: lp:~michael.nelson/launchpad/432152-orig-file-counts
Merge into: lp:launchpad
Diff against target: 95 lines
2 files modified
lib/lp/soyuz/model/archive.py (+19/-10)
lib/lp/soyuz/tests/test_archive.py (+37/-0)
To merge this branch: bzr merge lp:~michael.nelson/launchpad/432152-orig-file-counts
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+12645@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Michael Nelson (michael.nelson) wrote :

= Summary =

This is a fix for bug 432152, where tarball uploads to a PPA for
different distroseries was resulting in the tarball size being counted
once for each upload - even though we (can) only store one copy of the
tarball in the pool (for eg. the tarball
freeswitch_1.0.4+repack3.orig.tar.gz at
http://ppa.launchpad.net/pbxbuntu-drivers/ppa/ubuntu/pool/main/f/freeswitch/
is used for all the distroseries builds.)

== Proposed fix ==

== Pre-implementation notes ==

I had a pre-imp. with Celso who recommended that we not put the
responsibility of checking for previous uploads of the same librarian
file in the DSCFile, but rather, just updating the sources_size property
on the archive, to ensure that it only counts unique filenames in the pool.

== Implementation details ==

== Tests ==

bin/test -vv -t TestArchiveRepositorySize -t doc/archive.txt -t
archive-views.txt

== Demo and Q/A ==

No demo. Q/A we can do on dogfood, or check on production when William
uploads his next update.

= Launchpad lint =

Checking for conflicts. and issues in doctests and templates.
Running jslint, xmllint, pyflakes, and pylint.
Using normal rules.

Linting changed files:
  lib/lp/soyuz/tests/test_archive.py
  lib/lp/soyuz/model/archive.py

== Pylint notices ==

lib/lp/soyuz/model/archive.py
    14: [F0401] Unable to import 'lazr.lifecycle.event' (No module named
lifecycle)

--
Michael

Revision history for this message
Gavin Panella (allenap) wrote :

As you suggested on IRC, it's a good idea to use the sha1 in the query, if only to make it clear that the query will DTRT, and so that future maintainers of this code don't wonder what's happening. It seems that a good understanding of the data model is required to appreciate why the query is correct as it stands.

One other thing. I think you can safely avoid the extra query for the result.count() by using result.values(...):

{{{
--- lib/lp/soyuz/model/archive.py 2009-09-30 09:10:16 +0000
+++ lib/lp/soyuz/model/archive.py 2009-09-30 09:52:30 +0000
@@ -468,10 +468,7 @@
         result = result.config(distinct=True)

         # Storm's result.sum() throws errors when the result is empty.
- if result.count() == 0:
- return 0
- else:
- return result.sum(LibraryFileContent.filesize)
+ return sum(result.values(LibraryFileContent.filesize))

     def _getBinaryPublishingBaseClauses (
         self, name=None, version=None, status=None, distroarchseries=None,
}}}

I doubt this will be measurably slower than using result.sum(), but it should be quicker than running the query twice. Also, do you know if there's a bug filed for this issue in Storm, or is it reasonable behaviour?

Thanks, Gavin.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'lib/lp/soyuz/model/archive.py'
2--- lib/lp/soyuz/model/archive.py 2009-09-23 07:41:13 +0000
3+++ lib/lp/soyuz/model/archive.py 2009-09-30 15:06:17 +0000
4@@ -444,8 +444,10 @@
5 def sources_size(self):
6 """See `IArchive`."""
7 store = getUtility(IStoreSelector).get(MAIN_STORE, DEFAULT_FLAVOR)
8- result = store.find(
9- (LibraryFileContent),
10+ result = store.find((
11+ LibraryFileAlias.filename,
12+ LibraryFileContent.sha1,
13+ LibraryFileContent.filesize,),
14 SourcePackagePublishingHistory.archive == self.id,
15 SourcePackagePublishingHistory.dateremoved == None,
16 SourcePackagePublishingHistory.sourcepackagereleaseID ==
17@@ -453,15 +455,22 @@
18 SourcePackageReleaseFile.libraryfileID == LibraryFileAlias.id,
19 LibraryFileAlias.contentID == LibraryFileContent.id)
20
21- # We need to select distinct `LibraryFileContent`s because that how
22- # they end up published in the archive disk. Duplications may happen
23- # because of the publishing records join, the same `LibraryFileAlias`
24- # gets logically re-published in several locations and the fact that
25- # the same `LibraryFileContent` can be shared by multiple
26- # `LibraryFileAlias.` (librarian-gc).
27+ # Note: we can't use the LFC.sha1 instead of LFA.filename above
28+ # because some archives publish the same file content with different
29+ # names in the archive, so although the duplication will be removed
30+ # in the librarian by the librarian-gc, we do not (yet) remove
31+ # this duplication in the pool when the filenames are different.
32+
33+ # We need to select distinct on the (filename, filesize) result
34+ # so that we only count duplicate files (with the same filename)
35+ # once (ie. the same tarball used for different distroseries) as
36+ # we do avoid this duplication in the pool when the names are
37+ # the same.
38 result = result.config(distinct=True)
39- size = sum([lfc.filesize for lfc in result])
40- return size
41+
42+ # Using result.sum(LibraryFileContent.filesize) throws errors when
43+ # the result is empty, so instead:
44+ return sum(result.values(LibraryFileContent.filesize))
45
46 def _getBinaryPublishingBaseClauses (
47 self, name=None, version=None, status=None, distroarchseries=None,
48
49=== modified file 'lib/lp/soyuz/tests/test_archive.py'
50--- lib/lp/soyuz/tests/test_archive.py 2009-09-17 07:12:09 +0000
51+++ lib/lp/soyuz/tests/test_archive.py 2009-09-30 15:06:17 +0000
52@@ -168,6 +168,43 @@
53 previous_size + 1,
54 self.publisher.ubuntutest.main_archive.binaries_size)
55
56+ def test_sources_size_on_empty_archive(self):
57+ # Zero is returned for an archive without sources.
58+ self.assertEquals(
59+ 0, self.ppa.sources_size,
60+ 'Zero should be returned for an archive without sources.')
61+
62+ def test_sources_size_does_not_count_duplicated_files(self):
63+ # If there are multiple copies of the same file name/size
64+ # only one will be counted.
65+ pub_1 = self.publisher.getPubSource(
66+ filecontent='22', version='0.5.11~ppa1', archive=self.ppa)
67+
68+ pub_2 = self.publisher.getPubSource(
69+ filecontent='333', version='0.5.11~ppa2', archive=self.ppa)
70+
71+ self.assertEquals(5, self.ppa.sources_size)
72+
73+ shared_tarball = self.publisher.addMockFile(
74+ filename='foo_0.5.11.tar.gz', filecontent='1')
75+
76+ # After adding a the shared tarball to the ppa1 version,
77+ # the sources_size updates to reflect the change.
78+ pub_1.sourcepackagerelease.addFile(shared_tarball)
79+ self.assertEquals(
80+ 6, self.ppa.sources_size,
81+ 'The sources_size should update after a file is added.')
82+
83+ # But after adding a copy of the shared tarball to the ppa2 version,
84+ # the sources_size is unchanged.
85+ shared_tarball_copy = self.publisher.addMockFile(
86+ filename='foo_0.5.11.tar.gz', filecontent='1')
87+
88+ pub_2.sourcepackagerelease.addFile(shared_tarball_copy)
89+ self.assertEquals(
90+ 6, self.ppa.sources_size,
91+ 'The sources_size should change after adding a duplicate file.')
92+
93
94 class TestSeriesWithSources(TestCaseWithFactory):
95 """Create some sources in different series."""