Merge lp:~stevenk/launchpad/populate-bprc into lp:launchpad

Proposed by Steve Kowalik
Status: Work in progress
Proposed branch: lp:~stevenk/launchpad/populate-bprc
Merge into: lp:launchpad
Diff against target: 251 lines (+122/-4)
3 files modified
database/schema/security.cfg (+6/-0)
lib/lp/scripts/garbo.py (+71/-2)
lib/lp/scripts/tests/test_garbo.py (+45/-2)
To merge this branch: bzr merge lp:~stevenk/launchpad/populate-bprc
Reviewer Review Type Date Requested Status
Gavin Panella (community) Approve
Review via email: mp+69412@code.launchpad.net

Commit message

[r=allenap][bug=796997][incr] Add a garbo-hourly job that populates the BinaryPackageReleaseContents and BinaryPackagePath tables.

Description of the change

Following on from https://code.launchpad.net/~stevenk/launchpad/db-add-bprc/+merge/64783, this branch adds a garbo-hourly job to start population of the BinaryPackageReleaseContents and BinaryPackagePath tables.

To post a comment you must log in.
Revision history for this message
Gavin Panella (allenap) wrote :

[1]

+ maximum_chunk_size = 1

This is not allowing TunableLoop much scope for tuning the loop!

[2]

+ value = getUtility(IMemcacheClient).get('populate-bprc')
+ if not value:
+ self.start_at = 0
+ else:
+ self.start_at = value

Woah! :)

There are four memcache servers in production, all configured with the
same weighting, so this code has a 3-in-4 chance that it will not
retrieve what was written at then end of the last time through
PopulateBinaryPackageReleaseContents.__call__().

Also, memcache is expected to forget things. Combined with the above
gives a less than 1-in-4 chance of starting the loop again from where
it was last terminated.

[3]

+ def __call__(self, chunk_size):
+ for bprid in self.getCandidateBPRs(self.start_at)[:chunk_size]:
+ bpr = BinaryPackageRelease.get(bprid)

getCandidateBPRs() returns BinaryPackageRelease, not
BinaryPackageRelease.id. There's something fishy going on in this
branch.

[4]

+ def isDone(self):
+ return self.start_at > self.finish_at
+
+ def __call__(self, chunk_size):
+ for bprid in self.getCandidateBPRs(self.start_at)[:chunk_size]:

I have a suggestion here:

    def __init__(self, ...):
        super(...)
        self.done = False

    def isDone(self):
        return self.done

    def __call__(self, chunk_size):
        bprs = list(self.getCandidateBPRs(self.start_at)[:chunk_size])
        self.done = len(bprs) < chunk_size
        for bpr in bprs:
            ...

[5]

+ def test_populate_bprc(self):
+ LaunchpadZopelessLayer.switchDbUser('testadmin')

Perhaps self.layer.switchDbUser(...)?

Also, this test seems fairly fragile, and very dependent on test
data. Soyuz is a bit special in that respect, but don't cheat too much
;) If you have any doubt can you run this past bigjools?

review: Needs Fixing
Revision history for this message
William Grant (wgrant) wrote :

> [2]
>
> + value = getUtility(IMemcacheClient).get('populate-bprc')
> + if not value:
> + self.start_at = 0
> + else:
> + self.start_at = value
>
> Woah! :)
>
> There are four memcache servers in production, all configured with the
> same weighting, so this code has a 3-in-4 chance that it will not
> retrieve what was written at then end of the last time through
> PopulateBinaryPackageReleaseContents.__call__().

This is not correct: memcached clients use a hash of the key to determine which server is to be used.

> Also, memcache is expected to forget things. Combined with the above
> gives a less than 1-in-4 chance of starting the loop again from where
> it was last terminated.

This is a pattern that we've used before. Production memcached is under very little memory pressure and this value is frequently updated, so it rarely forgets things like this. Even if it does forget, the job just has to redo a bit of work on the failed imports. Harmless.

Revision history for this message
Gavin Panella (allenap) wrote :

> > [2]
[...]
> This is not correct: memcached clients use a hash of the key to
> determine which server is to be used.

Ha, yes, that makes sense too :) Sorry for the misunderstanding. In
the Pre-Sprogian Era I would have remembered that.

> > Also, memcache is expected to forget things. Combined with the
> > above gives a less than 1-in-4 chance of starting the loop again
> > from where it was last terminated.
>
> This is a pattern that we've used before. Production memcached is
> under very little memory pressure

For now. If the memory pressure ever increases to the point where this
value is being ejected from memcached before being read back then the
work is going to be done over and over and we may not be aware of it.

> ... and this value is frequently updated, so it rarely forgets
> things like this. Even if it does forget, the job just has to redo a
> bit of work on the failed imports. Harmless.

I did look for other uses of this pattern in garbo.py, but not any
further afield, so sorry if I missed something that would have changed
my review.

I still feel that memcache is not the right place to put this, but if
it's just a case of pragmatism winning over correctness for now then
I'm okay with it.

I assume this is meant to deal with aborted tasks (w.r.t. abort_time
in LoopTuner). Is that correct?

Revision history for this message
Robert Collins (lifeless) wrote :

On Wed, Aug 3, 2011 at 8:23 PM, Gavin Panella
<email address hidden> wrote:

>> This is a pattern that we've used before. Production memcached is
>> under very little memory pressure
>
> For now. If the memory pressure ever increases to the point where this
> value is being ejected from memcached before being read back then the
> work is going to be done over and over and we may not be aware of it.

We've nearly totally disabled memcache on the appservers, due to the
way we were using it being very poor. Your point is accurate, but its
not a pragmatic risk today.

>> ... and this value is frequently updated, so it rarely forgets
>> things like this. Even if it does forget, the job just has to redo a
>> bit of work on the failed imports. Harmless.
>
> I did look for other uses of this pattern in garbo.py, but not any
> further afield, so sorry if I missed something that would have changed
> my review.

It was used for migrations in the past, which were then deleted when done ;)

> I still feel that memcache is not the right place to put this, but if
> it's just a case of pragmatism winning over correctness for now then
> I'm okay with it.
>
> I assume this is meant to deal with aborted tasks (w.r.t. abort_time
> in LoopTuner). Is that correct?

And general efficiency.

FWIW this approach has a huge +1 from, its a very sensible approach -
lightweight, flexible, works.

-Rob

Revision history for this message
Gavin Panella (allenap) wrote :

> FWIW this approach has a huge +1 from, its a very sensible approach -
> lightweight, flexible, works.

I'm not going to argue with that :)

Revision history for this message
Gavin Panella (allenap) wrote :

<StevenK> allenap: You marked
  https://code.launchpad.net/~stevenk/launchpad/populate-bprc/+merge/69412
  as Needs Fixing a week ago, and you have since been smacked down by
  wgrant and lifeless. Can you reconsider? :-)
<allenap> StevenK: There were other points in the review :)
<StevenK> allenap: Right, so I'll look at getCandidateBPRs(). I
  disagree about the loop size -- if it gets killed half-way through a
  loop, for example?
<StevenK> allenap: And I don't agree the test is dependant on test
  data. How?
<allenap> StevenK: Why do you only want to do one thing round the
  loop? I suppose it keeps transactions short, but it's meant to tune
  for than anyway isn't it?
<StevenK> allenap: So, each iteration reads a .deb from the librarian,
  parses the contents and starts tossing rows into BPRC and BPP. If
  that loop gets interruptted, do I end up with half of the data in
  the tables and half not?
<allenap> StevenK: No? Why is that a concern? I don't think it
  interrupts in the middle of a run. And transactions.
<StevenK> allenap: My basis for this was the PSC work -- and that was
  actually unpacking source packages and dealing with a bunch of
  temporary files, so it was more critical it wasn't killed in the
  middle of a run
<StevenK> allenap: Conceeding the loop tuning question -- I'll bump it
  20 or so
<allenap> StevenK: Okay :)
<StevenK> allenap: So, the test is dependant on sample data?
<allenap> StevenK: It looked that way, but I didn't dig much. If you
  say it's not then that's all I want to know.
<allenap> StevenK: What about 3? That doesn't actually look like it's
  going to work how it is.
<StevenK> allenap: I didn't want to include another .deb in the tree,
  so I make use of one that's already there due to archiveuploader
  tests
<allenap> StevenK: Oh, ignore me. EPARSE.
<allenap> StevenK: So, r=me now, but consider point 4.

review: Approve
Revision history for this message
Gavin Panella (allenap) wrote :

tl;dr is:

[1] Steven has agreed to change,
[2] has been discussed previously, no change needed,
[3] finds me smoking crack again,
[4] should still be considered,
[5] Steven says is fine, which is all I wanted/needed to know.

lp:~stevenk/launchpad/populate-bprc updated
13477. By Steve Kowalik

Use the BPR directly and set the maximum chunk size to 20

13478. By Steve Kowalik

Merge devel, resolving conflicts.

13479. By Steve Kowalik

Fix getCandidateBPRs.

13480. By Steve Kowalik

Switch to self.done, rather than self.finish_at

13481. By Steve Kowalik

Skip any BPRs that already have entries, fix dates, and change how done is
calculated.

13482. By Steve Kowalik

Switch from a non-working not in to a not exists subselect.

13483. By Steve Kowalik

Merge devel, resolving conflicts.

13484. By Steve Kowalik

Stop logging success, fix how we determine when the population is done, and
correct Storm's misguided attempt at working out which tables are needed in
the inner select.

13485. By Steve Kowalik

Merge devel

13486. By Steve Kowalik

Fix up DB perms a little better, and use testadmin to drop all BPFs.
The message that a BPR can't be added is now a warning.

Revision history for this message
Francis J. Lacoste (flacoste) wrote :

I'm marking this Work-in-progress as it has received significant edits after the review and it hasn't been merged yet.

Unmerged revisions

13486. By Steve Kowalik

Fix up DB perms a little better, and use testadmin to drop all BPFs.
The message that a BPR can't be added is now a warning.

13485. By Steve Kowalik

Merge devel

13484. By Steve Kowalik

Stop logging success, fix how we determine when the population is done, and
correct Storm's misguided attempt at working out which tables are needed in
the inner select.

13483. By Steve Kowalik

Merge devel, resolving conflicts.

13482. By Steve Kowalik

Switch from a non-working not in to a not exists subselect.

13481. By Steve Kowalik

Skip any BPRs that already have entries, fix dates, and change how done is
calculated.

13480. By Steve Kowalik

Switch to self.done, rather than self.finish_at

13479. By Steve Kowalik

Fix getCandidateBPRs.

13478. By Steve Kowalik

Merge devel, resolving conflicts.

13477. By Steve Kowalik

Use the BPR directly and set the maximum chunk size to 20

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'database/schema/security.cfg'
2--- database/schema/security.cfg 2011-08-19 16:03:54 +0000
3+++ database/schema/security.cfg 2011-08-23 00:52:55 +0000
4@@ -118,6 +118,7 @@
5 public.archivesubscriber = SELECT, INSERT, UPDATE
6 public.authtoken = SELECT, INSERT, UPDATE, DELETE
7 public.binaryandsourcepackagenameview = SELECT
8+public.binarypackagefile = SELECT, INSERT
9 public.binarypackagepath = SELECT, INSERT, DELETE
10 public.binarypackagepublishinghistory = SELECT
11 public.binarypackagereleasecontents = SELECT, INSERT, DELETE
12@@ -2153,6 +2154,10 @@
13 [garbo]
14 groups=script,read
15 public.answercontact = SELECT, DELETE
16+public.binarypackagefile = SELECT
17+public.binarypackagepath = SELECT, INSERT
18+public.binarypackagerelease = SELECT
19+public.binarypackagereleasecontents = SELECT, INSERT
20 public.branchjob = SELECT, DELETE
21 public.bug = SELECT, UPDATE
22 public.bugaffectsperson = SELECT
23@@ -2176,6 +2181,7 @@
24 public.codeimportresult = SELECT, DELETE
25 public.emailaddress = SELECT, UPDATE
26 public.hwsubmission = SELECT, UPDATE
27+public.libraryfilealias = SELECT
28 public.job = SELECT, INSERT, DELETE
29 public.mailinglistsubscription = SELECT, DELETE
30 public.oauthnonce = SELECT, DELETE
31
32=== modified file 'lib/lp/scripts/garbo.py'
33--- lib/lp/scripts/garbo.py 2011-08-17 10:43:26 +0000
34+++ lib/lp/scripts/garbo.py 2011-08-23 00:52:55 +0000
35@@ -1,4 +1,4 @@
36-# Copyright 2009-2010 Canonical Ltd. This software is licensed under the
37+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
38 # GNU Affero General Public License version 3 (see the file LICENSE).
39
40 """Database garbage collection."""
41@@ -26,9 +26,14 @@
42 from psycopg2 import IntegrityError
43 import pytz
44 from storm.expr import (
45+ Exists,
46 In,
47+ Join,
48+ Not,
49+ Select,
50 )
51 from storm.locals import (
52+ And,
53 Max,
54 Min,
55 SQL,
56@@ -46,7 +51,10 @@
57 sqlvalues,
58 )
59 from canonical.launchpad.database.emailaddress import EmailAddress
60-from canonical.launchpad.database.librarian import TimeLimitedToken
61+from canonical.launchpad.database.librarian import (
62+ LibraryFileAlias,
63+ TimeLimitedToken,
64+ )
65 from canonical.launchpad.database.oauth import OAuthNonce
66 from canonical.launchpad.database.openidconsumer import OpenIDConsumerNonce
67 from canonical.launchpad.interfaces.account import AccountStatus
68@@ -80,6 +88,7 @@
69 from lp.registry.model.person import Person
70 from lp.services.job.model.job import Job
71 from lp.services.log.logger import PrefixFilter
72+from lp.services.memcache.interfaces import IMemcacheClient
73 from lp.services.propertycache import cachedproperty
74 from lp.services.scripts.base import (
75 LaunchpadCronScript,
76@@ -87,6 +96,14 @@
77 SilentLaunchpadScriptFailure,
78 )
79 from lp.services.session.model import SessionData
80+from lp.soyuz.interfaces.binarypackagereleasecontents import (
81+ IBinaryPackageReleaseContentsSet,
82+ )
83+from lp.soyuz.model.binarypackagereleasecontents import (
84+ BinaryPackageReleaseContents,
85+ )
86+from lp.soyuz.model.binarypackagerelease import BinaryPackageRelease
87+from lp.soyuz.model.files import BinaryPackageFile
88 from lp.translations.interfaces.potemplate import IPOTemplateSet
89 from lp.translations.model.potranslation import POTranslation
90 from lp.translations.model.potmsgset import POTMsgSet
91@@ -868,6 +885,57 @@
92 self.done = True
93
94
95+class PopulateBinaryPackageReleaseContents(TunableLoop):
96+ maximum_chunk_size = 20
97+
98+ def __init__(self, log, abort_time=None):
99+ super(PopulateBinaryPackageReleaseContents, self).__init__(
100+ log, abort_time)
101+ value = getUtility(IMemcacheClient).get('populate-bprc')
102+ if not value:
103+ self.start_at = 0
104+ else:
105+ self.start_at = value
106+ self.done = self.getCandidateBPRs(self.start_at).is_empty()
107+
108+ def getCandidateBPRs(self, start_at):
109+ return IMasterStore(BinaryPackageRelease).using(
110+ BinaryPackageRelease,
111+ Join(
112+ BinaryPackageFile,
113+ BinaryPackageFile.binarypackagereleaseID ==
114+ BinaryPackageRelease.id),
115+ Join(
116+ LibraryFileAlias,
117+ And(LibraryFileAlias.id == BinaryPackageFile.libraryfileID,
118+ LibraryFileAlias.content != None
119+ ))).find(
120+ BinaryPackageRelease, BinaryPackageRelease.id >= start_at,
121+ Not(Exists(Select(
122+ BinaryPackageReleaseContents.binarypackagepath_id,
123+ BinaryPackageReleaseContents.binarypackagerelease_id ==
124+ BinaryPackageRelease.id,
125+ tables=[BinaryPackageReleaseContents])))
126+ ).order_by(BinaryPackageRelease.id)
127+
128+ def isDone(self):
129+ """See `TunableLoop`."""
130+ return self.done
131+
132+ def __call__(self, chunk_size):
133+ """See `TunableLoop`."""
134+ bprs = self.getCandidateBPRs(self.start_at)[:chunk_size]
135+ self.done = self.getCandidateBPRs(self.start_at).is_empty()
136+ for bpr in bprs:
137+ result = getUtility(IBinaryPackageReleaseContentsSet).add(bpr)
138+ if not result:
139+ self.log.warning('BPR %d unable to be added.' % bpr.id)
140+ self.start_at = bpr.id + 1
141+ result = getUtility(IMemcacheClient).set(
142+ 'populate-bprc', self.start_at)
143+ transaction.commit()
144+
145+
146 class UnusedPOTMsgSetPruner(TunableLoop):
147 """Cleans up unused POTMsgSets."""
148
149@@ -1157,6 +1225,7 @@
150 UnusedSessionPruner,
151 DuplicateSessionPruner,
152 BugHeatUpdater,
153+ PopulateBinaryPackageReleaseContents,
154 ]
155 experimental_tunable_loops = []
156
157
158=== modified file 'lib/lp/scripts/tests/test_garbo.py'
159--- lib/lp/scripts/tests/test_garbo.py 2011-08-22 15:08:03 +0000
160+++ lib/lp/scripts/tests/test_garbo.py 2011-08-23 00:52:55 +0000
161@@ -1,10 +1,9 @@
162-# Copyright 2009-2010 Canonical Ltd. This software is licensed under the
163+# Copyright 2009-2011 Canonical Ltd. This software is licensed under the
164 # GNU Affero General Public License version 3 (see the file LICENSE).
165
166 """Test the database garbage collector."""
167
168 __metaclass__ = type
169-__all__ = []
170
171 from datetime import (
172 datetime,
173@@ -60,6 +59,7 @@
174 LaunchpadZopelessLayer,
175 ZopelessDatabaseLayer,
176 )
177+from lp.archiveuploader.tests import datadir
178 from lp.answers.model.answercontact import AnswerContact
179 from lp.bugs.model.bugmessage import BugMessage
180 from lp.bugs.model.bugnotification import (
181@@ -98,6 +98,10 @@
182 SessionPkgData,
183 )
184 from lp.services.worlddata.interfaces.language import ILanguageSet
185+from lp.soyuz.model.binarypackagereleasecontents import (
186+ BinaryPackageReleaseContents,
187+ )
188+from lp.soyuz.model.files import BinaryPackageFile
189 from lp.testing import (
190 person_logged_in,
191 TestCase,
192@@ -122,6 +126,11 @@
193
194 def test_hourly_script(self):
195 """Ensure garbo-hourly.py actually runs."""
196+ # Our sampledata doesn't contain anythng that
197+ # PopulateBinaryPackageReleaseContents can process without errors,
198+ # so it's easier to just remove every BinaryPackageFile.
199+ IMasterStore(BinaryPackageFile).find(BinaryPackageFile).remove()
200+ transaction.commit()
201 rv, out, err = run_script(
202 "cronscripts/garbo-hourly.py", ["-q"], expect_returncode=0)
203 self.failIf(out.strip(), "Output to stdout: %s" % out)
204@@ -367,6 +376,13 @@
205 self.log.addHandler(NullHandler())
206 self.log.propagate = 0
207
208+ self.layer.switchDbUser('testadmin')
209+ # Our sampledata doesn't contain anythng that
210+ # PopulateBinaryPackageReleaseContents can process without errors,
211+ # so it's easier to just remove every BinaryPackageFile.
212+ IMasterStore(BinaryPackageFile).find(BinaryPackageFile).remove()
213+ transaction.commit()
214+
215 # Run the garbage collectors to remove any existing garbage,
216 # starting us in a known state.
217 self.runDaily()
218@@ -948,6 +964,33 @@
219
220 self.assertEqual(1, count)
221
222+ def test_populate_bprc(self):
223+ LaunchpadZopelessLayer.switchDbUser('testadmin')
224+ bpr = self.factory.makeBinaryPackageRelease()
225+ deb = open(datadir('pmount_0.9.7-2ubuntu2_amd64.deb'), 'r')
226+ lfa = self.factory.makeLibraryFileAlias(
227+ filename='pmount_0.9.7-2ubuntu2_amd64.deb', content=deb.read())
228+ deb.close()
229+ transaction.commit()
230+ bpr.addFile(lfa)
231+ transaction.commit()
232+ self.runHourly()
233+ bprc = IMasterStore(BinaryPackageReleaseContents).find(
234+ BinaryPackageReleaseContents)
235+ self.assertEqual(13, bprc.count())
236+ paths = map(lambda x: x.binarypackagepath.path, bprc)
237+ expected_paths = [
238+ 'etc/pmount.allow', 'usr/bin/pumount', 'usr/bin/pmount-hal',
239+ 'usr/bin/pmount', 'usr/share/doc/pmount/TODO',
240+ 'usr/share/doc/pmount/README.Debian',
241+ 'usr/share/doc/pmount/AUTHORS', 'usr/share/doc/pmount/copyright',
242+ 'usr/share/doc/pmount/changelog.gz',
243+ 'usr/share/doc/pmount/changelog.Debian.gz',
244+ 'usr/share/man/man1/pmount-hal.1.gz',
245+ 'usr/share/man/man1/pmount.1.gz',
246+ 'usr/share/man/man1/pumount.1.gz']
247+ self.assertContentEqual(expected_paths, paths)
248+
249 def test_UnusedPOTMsgSetPruner_removes_obsolete_message_sets(self):
250 # UnusedPOTMsgSetPruner removes any POTMsgSet that are
251 # participating in a POTemplate only as obsolete messages.