Launchpad itself

Merge lp:~stevenk/launchpad/populate-bprc into lp:launchpad

populate-bprc
Merge into devel

Proposed by Steve Kowalik on 2011-07-27

Status:

Work in progress

Proposed branch:

lp:~stevenk/launchpad/populate-bprc

Merge into:

lp:launchpad

Diff against target:

251 lines (+122/-4)

3 files modified

database/schema/security.cfg (+6/-0)
lib/lp/scripts/garbo.py (+71/-2)
lib/lp/scripts/tests/test_garbo.py (+45/-2)

To merge this branch:

bzr merge lp:~stevenk/launchpad/populate-bprc

Low

Triaged

Link a bug report

Reviewer	Review Type	Date Requested	Status
Gavin Panella (community)		2011-07-27	Approve on 2011-08-04
Review via email: mp+69412@code.launchpad.net

Commit message

[r=allenap][bug=796997][incr] Add a garbo-hourly job that populates the BinaryPackageReleaseContents and BinaryPackagePath tables.

Description of the change

Following on from https://code.launchpad.net/~stevenk/launchpad/db-add-bprc/+merge/64783, this branch adds a garbo-hourly job to start population of the BinaryPackageReleaseContents and BinaryPackagePath tables.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-07-28:

[1]

+ maximum_chunk_size = 1

This is not allowing TunableLoop much scope for tuning the loop!

[2]

+ value = getUtility(IMemcacheClient).get('populate-bprc')
+ if not value:
+ self.start_at = 0
+ else:
+ self.start_at = value

Woah! :)

There are four memcache servers in production, all configured with the
same weighting, so this code has a 3-in-4 chance that it will not
retrieve what was written at then end of the last time through
PopulateBinaryPackageReleaseContents.__call__().

Also, memcache is expected to forget things. Combined with the above
gives a less than 1-in-4 chance of starting the loop again from where
it was last terminated.

[3]

+ def __call__(self, chunk_size):
+ for bprid in self.getCandidateBPRs(self.start_at)[:chunk_size]:
+ bpr = BinaryPackageRelease.get(bprid)

getCandidateBPRs() returns BinaryPackageRelease, not
BinaryPackageRelease.id. There's something fishy going on in this
branch.

[4]

+ def isDone(self):
+ return self.start_at > self.finish_at
+
+ def __call__(self, chunk_size):
+ for bprid in self.getCandidateBPRs(self.start_at)[:chunk_size]:

I have a suggestion here:

    def __init__(self, ...):
        super(...)
        self.done = False

def isDone(self):
return self.done

    def __call__(self, chunk_size):
        bprs = list(self.getCandidateBPRs(self.start_at)[:chunk_size])
        self.done = len(bprs) < chunk_size
        for bpr in bprs:
            ...

[5]

+ def test_populate_bprc(self):
+ LaunchpadZopelessLayer.switchDbUser('testadmin')

Perhaps self.layer.switchDbUser(...)?

Also, this test seems fairly fragile, and very dependent on test
data. Soyuz is a bit special in that respect, but don't cheat too much
;) If you have any doubt can you run this past bigjools?

review: Needs Fixing

Revision history for this message

William Grant (wgrant) wrote on 2011-08-03:

> [2]
>
> + value = getUtility(IMemcacheClient).get('populate-bprc')
> + if not value:
> + self.start_at = 0
> + else:
> + self.start_at = value
>
> Woah! :)
>
> There are four memcache servers in production, all configured with the
> same weighting, so this code has a 3-in-4 chance that it will not
> retrieve what was written at then end of the last time through
> PopulateBinaryPackageReleaseContents.__call__().

This is not correct: memcached clients use a hash of the key to determine which server is to be used.

> Also, memcache is expected to forget things. Combined with the above
> gives a less than 1-in-4 chance of starting the loop again from where
> it was last terminated.

This is a pattern that we've used before. Production memcached is under very little memory pressure and this value is frequently updated, so it rarely forgets things like this. Even if it does forget, the job just has to redo a bit of work on the failed imports. Harmless.

Revision history for this message

Gavin Panella (allenap) wrote on 2011-08-03:

> > [2]
[...]
> This is not correct: memcached clients use a hash of the key to
> determine which server is to be used.

Ha, yes, that makes sense too :) Sorry for the misunderstanding. In
the Pre-Sprogian Era I would have remembered that.

> > Also, memcache is expected to forget things. Combined with the
> > above gives a less than 1-in-4 chance of starting the loop again
> > from where it was last terminated.
>
> This is a pattern that we've used before. Production memcached is
> under very little memory pressure

For now. If the memory pressure ever increases to the point where this
value is being ejected from memcached before being read back then the
work is going to be done over and over and we may not be aware of it.

> ... and this value is frequently updated, so it rarely forgets
> things like this. Even if it does forget, the job just has to redo a
> bit of work on the failed imports. Harmless.

I did look for other uses of this pattern in garbo.py, but not any
further afield, so sorry if I missed something that would have changed
my review.

I still feel that memcache is not the right place to put this, but if
it's just a case of pragmatism winning over correctness for now then
I'm okay with it.

I assume this is meant to deal with aborted tasks (w.r.t. abort_time
in LoopTuner). Is that correct?

Revision history for this message

Robert Collins (lifeless) wrote on 2011-08-03:

On Wed, Aug 3, 2011 at 8:23 PM, Gavin Panella
<email address hidden> wrote:

>> This is a pattern that we've used before. Production memcached is
>> under very little memory pressure
>
> For now. If the memory pressure ever increases to the point where this
> value is being ejected from memcached before being read back then the
> work is going to be done over and over and we may not be aware of it.

We've nearly totally disabled memcache on the appservers, due to the
way we were using it being very poor. Your point is accurate, but its
not a pragmatic risk today.

>> ... and this value is frequently updated, so it rarely forgets
>> things like this. Even if it does forget, the job just has to redo a
>> bit of work on the failed imports. Harmless.
>
> I did look for other uses of this pattern in garbo.py, but not any
> further afield, so sorry if I missed something that would have changed
> my review.

It was used for migrations in the past, which were then deleted when done ;)

> I still feel that memcache is not the right place to put this, but if
> it's just a case of pragmatism winning over correctness for now then
> I'm okay with it.
>
> I assume this is meant to deal with aborted tasks (w.r.t. abort_time
> in LoopTuner). Is that correct?

And general efficiency.

FWIW this approach has a huge +1 from, its a very sensible approach -
lightweight, flexible, works.

-Rob

Revision history for this message

Gavin Panella (allenap) wrote on 2011-08-03:

> FWIW this approach has a huge +1 from, its a very sensible approach -
> lightweight, flexible, works.

I'm not going to argue with that :)

Revision history for this message

Gavin Panella (allenap) wrote on 2011-08-04:

<StevenK> allenap: You marked
  https://code.launchpad.net/~stevenk/launchpad/populate-bprc/+merge/69412
  as Needs Fixing a week ago, and you have since been smacked down by
  wgrant and lifeless. Can you reconsider? :-)
<allenap> StevenK: There were other points in the review :)
<StevenK> allenap: Right, so I'll look at getCandidateBPRs(). I
  disagree about the loop size -- if it gets killed half-way through a
  loop, for example?
<StevenK> allenap: And I don't agree the test is dependant on test
  data. How?
<allenap> StevenK: Why do you only want to do one thing round the
  loop? I suppose it keeps transactions short, but it's meant to tune
  for than anyway isn't it?
<StevenK> allenap: So, each iteration reads a .deb from the librarian,
  parses the contents and starts tossing rows into BPRC and BPP. If
  that loop gets interruptted, do I end up with half of the data in
  the tables and half not?
<allenap> StevenK: No? Why is that a concern? I don't think it
  interrupts in the middle of a run. And transactions.
<StevenK> allenap: My basis for this was the PSC work -- and that was
  actually unpacking source packages and dealing with a bunch of
  temporary files, so it was more critical it wasn't killed in the
  middle of a run
<StevenK> allenap: Conceeding the loop tuning question -- I'll bump it
  20 or so
<allenap> StevenK: Okay :)
<StevenK> allenap: So, the test is dependant on sample data?
<allenap> StevenK: It looked that way, but I didn't dig much. If you
  say it's not then that's all I want to know.
<allenap> StevenK: What about 3? That doesn't actually look like it's
  going to work how it is.
<StevenK> allenap: I didn't want to include another .deb in the tree,
  so I make use of one that's already there due to archiveuploader
  tests
<allenap> StevenK: Oh, ignore me. EPARSE.
<allenap> StevenK: So, r=me now, but consider point 4.

review: Approve

Revision history for this message

Gavin Panella (allenap) wrote on 2011-08-04:

tl;dr is:

[1] Steven has agreed to change,
[2] has been discussed previously, no change needed,
[3] finds me smoking crack again,
[4] should still be considered,
[5] Steven says is fine, which is all I wanted/needed to know.

lp:~stevenk/launchpad/populate-bprc updated on 2011-08-23

13477. By Steve Kowalik on 2011-08-04: Use the BPR directly and set the maximum chunk size to 20
13478. By Steve Kowalik on 2011-08-14: Merge devel, resolving conflicts.
13479. By Steve Kowalik on 2011-08-15: Fix getCandidateBPRs.
13480. By Steve Kowalik on 2011-08-16: Switch to self.done, rather than self.finish_at
13481. By Steve Kowalik on 2011-08-18: Skip any BPRs that already have entries, fix dates, and change how done is
calculated.
13482. By Steve Kowalik on 2011-08-19: Switch from a non-working not in to a not exists subselect.
13483. By Steve Kowalik on 2011-08-19: Merge devel, resolving conflicts.
13484. By Steve Kowalik on 2011-08-21: Stop logging success, fix how we determine when the population is done, and
correct Storm's misguided attempt at working out which tables are needed in
the inner select.
13485. By Steve Kowalik on 2011-08-23: Merge devel
13486. By Steve Kowalik on 2011-08-23: Fix up DB perms a little better, and use testadmin to drop all BPFs.
The message that a BPR can't be added is now a warning.

Revision history for this message

Francis J. Lacoste (flacoste) wrote on 2011-11-18:

I'm marking this Work-in-progress as it has received significant edits after the review and it hasn't been merged yet.

Unmerged revisions

13486. By Steve Kowalik on 2011-08-23: Fix up DB perms a little better, and use testadmin to drop all BPFs.
The message that a BPR can't be added is now a warning.
13485. By Steve Kowalik on 2011-08-23: Merge devel
13484. By Steve Kowalik on 2011-08-21: Stop logging success, fix how we determine when the population is done, and
correct Storm's misguided attempt at working out which tables are needed in
the inner select.
13483. By Steve Kowalik on 2011-08-19: Merge devel, resolving conflicts.
13482. By Steve Kowalik on 2011-08-19: Switch from a non-working not in to a not exists subselect.
13481. By Steve Kowalik on 2011-08-18: Skip any BPRs that already have entries, fix dates, and change how done is
calculated.
13480. By Steve Kowalik on 2011-08-16: Switch to self.done, rather than self.finish_at
13479. By Steve Kowalik on 2011-08-15: Fix getCandidateBPRs.
13478. By Steve Kowalik on 2011-08-14: Merge devel, resolving conflicts.
13477. By Steve Kowalik on 2011-08-04: Use the BPR directly and set the maximum chunk size to 20

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Barki Mustapha

Celso Providelo

Christian Reis

Christy Awad

Colin Watson

Harpianto,ANDI

James Troup

John A Meinel

Kevin bush

Launchpad code reviewers

Launchpad code reviewers from Canonical

Matthew Tanner

Maximiliano Bertacchini

Oguz Ersoz

Simon Brakhane

Steve Kowalik

Ubuntu-BR DevOps

William Grant

alhawiti

api.ng

pedro cavazos

todaioan

wenjingwen

to status/vote changes:

Tzaddi

Tzaddi Belding

 === modified file 'database/schema/security.cfg'
 --- database/schema/security.cfg	2011-08-19 16:03:54 +0000
 +++ database/schema/security.cfg	2011-08-23 00:52:55 +0000
@@ -118,6 +118,7 @@
  public.archivesubscriber                = SELECT, INSERT, UPDATE
  public.authtoken                        = SELECT, INSERT, UPDATE, DELETE
  public.binaryandsourcepackagenameview   = SELECT
++public.binarypackagefile                = SELECT, INSERT
  public.binarypackagepath                = SELECT, INSERT, DELETE
  public.binarypackagepublishinghistory   = SELECT
  public.binarypackagereleasecontents     = SELECT, INSERT, DELETE
@@ -2153,6 +2154,10 @@
  [garbo]
  groups=script,read
  public.answercontact                    = SELECT, DELETE
++public.binarypackagefile                = SELECT
++public.binarypackagepath                = SELECT, INSERT
++public.binarypackagerelease             = SELECT
++public.binarypackagereleasecontents     = SELECT, INSERT
  public.branchjob                        = SELECT, DELETE
  public.bug                              = SELECT, UPDATE
  public.bugaffectsperson                 = SELECT
@@ -2176,6 +2181,7 @@
  public.codeimportresult                 = SELECT, DELETE
  public.emailaddress                     = SELECT, UPDATE
  public.hwsubmission                     = SELECT, UPDATE
++public.libraryfilealias                 = SELECT
  public.job                              = SELECT, INSERT, DELETE
  public.mailinglistsubscription          = SELECT, DELETE
  public.oauthnonce                       = SELECT, DELETE
 === modified file 'lib/lp/scripts/garbo.py'
 --- lib/lp/scripts/garbo.py	2011-08-17 10:43:26 +0000
 +++ lib/lp/scripts/garbo.py	2011-08-23 00:52:55 +0000
@@ -1,4 +1,4 @@
--# Copyright 2009-2010 Canonical Ltd.  This software is licensed under the
++# Copyright 2009-2011 Canonical Ltd.  This software is licensed under the
  # GNU Affero General Public License version 3 (see the file LICENSE).
  """Database garbage collection."""
@@ -26,9 +26,14 @@
  from psycopg2 import IntegrityError
  import pytz
  from storm.expr import (
++    Exists,
      In,
++    Join,
++    Not,
++    Select,
+     )
  from storm.locals import (
++    And,
      Max,
      Min,
      SQL,
@@ -46,7 +51,10 @@
      sqlvalues,
+     )
  from canonical.launchpad.database.emailaddress import EmailAddress
--from canonical.launchpad.database.librarian import TimeLimitedToken
++from canonical.launchpad.database.librarian import (
++    LibraryFileAlias,
++    TimeLimitedToken,
++    )
  from canonical.launchpad.database.oauth import OAuthNonce
  from canonical.launchpad.database.openidconsumer import OpenIDConsumerNonce
  from canonical.launchpad.interfaces.account import AccountStatus
@@ -80,6 +88,7 @@
  from lp.registry.model.person import Person
  from lp.services.job.model.job import Job
  from lp.services.log.logger import PrefixFilter
++from lp.services.memcache.interfaces import IMemcacheClient
  from lp.services.propertycache import cachedproperty
  from lp.services.scripts.base import (
      LaunchpadCronScript,
@@ -87,6 +96,14 @@
      SilentLaunchpadScriptFailure,
+     )
  from lp.services.session.model import SessionData
++from lp.soyuz.interfaces.binarypackagereleasecontents import (
++    IBinaryPackageReleaseContentsSet,
++    )
++from lp.soyuz.model.binarypackagereleasecontents import (
++    BinaryPackageReleaseContents,
++    )
++from lp.soyuz.model.binarypackagerelease import BinaryPackageRelease
++from lp.soyuz.model.files import BinaryPackageFile
  from lp.translations.interfaces.potemplate import IPOTemplateSet
  from lp.translations.model.potranslation import POTranslation
  from lp.translations.model.potmsgset import POTMsgSet
@@ -868,6 +885,57 @@
          self.done = True
++class PopulateBinaryPackageReleaseContents(TunableLoop):
++    maximum_chunk_size = 20
++
++    def __init__(self, log, abort_time=None):
++        super(PopulateBinaryPackageReleaseContents, self).__init__(
++            log, abort_time)
++        value = getUtility(IMemcacheClient).get('populate-bprc')
++        if not value:
++            self.start_at = 0
++        else:
++            self.start_at = value
++        self.done = self.getCandidateBPRs(self.start_at).is_empty()
++
++    def getCandidateBPRs(self, start_at):
++        return IMasterStore(BinaryPackageRelease).using(
++            BinaryPackageRelease,
++            Join(
++                BinaryPackageFile,
++                BinaryPackageFile.binarypackagereleaseID ==
++                    BinaryPackageRelease.id),
++            Join(
++                LibraryFileAlias,
++                And(LibraryFileAlias.id == BinaryPackageFile.libraryfileID,
++                    LibraryFileAlias.content != None
++            ))).find(
++                BinaryPackageRelease, BinaryPackageRelease.id >= start_at,
++                Not(Exists(Select(
++                    BinaryPackageReleaseContents.binarypackagepath_id,
++                    BinaryPackageReleaseContents.binarypackagerelease_id ==
++                        BinaryPackageRelease.id,
++                    tables=[BinaryPackageReleaseContents])))
++            ).order_by(BinaryPackageRelease.id)
++
++    def isDone(self):
++        """See `TunableLoop`."""
++        return self.done
++
++    def __call__(self, chunk_size):
++        """See `TunableLoop`."""
++        bprs = self.getCandidateBPRs(self.start_at)[:chunk_size]
++        self.done = self.getCandidateBPRs(self.start_at).is_empty()
++        for bpr in bprs:
++            result = getUtility(IBinaryPackageReleaseContentsSet).add(bpr)
++            if not result:
++                self.log.warning('BPR %d unable to be added.' % bpr.id)
++            self.start_at = bpr.id + 1
++        result = getUtility(IMemcacheClient).set(
++            'populate-bprc', self.start_at)
++        transaction.commit()
++
++
  class UnusedPOTMsgSetPruner(TunableLoop):
      """Cleans up unused POTMsgSets."""
@@ -1157,6 +1225,7 @@
          UnusedSessionPruner,
          DuplicateSessionPruner,
          BugHeatUpdater,
++        PopulateBinaryPackageReleaseContents,
+         ]
      experimental_tunable_loops = []
 === modified file 'lib/lp/scripts/tests/test_garbo.py'
 --- lib/lp/scripts/tests/test_garbo.py	2011-08-22 15:08:03 +0000
 +++ lib/lp/scripts/tests/test_garbo.py	2011-08-23 00:52:55 +0000
@@ -1,10 +1,9 @@
--# Copyright 2009-2010 Canonical Ltd.  This software is licensed under the
++# Copyright 2009-2011 Canonical Ltd.  This software is licensed under the
  # GNU Affero General Public License version 3 (see the file LICENSE).
  """Test the database garbage collector."""
  __metaclass__ = type
--__all__ = []
  from datetime import (
      datetime,
@@ -60,6 +59,7 @@
      LaunchpadZopelessLayer,
      ZopelessDatabaseLayer,
+     )
++from lp.archiveuploader.tests import datadir
  from lp.answers.model.answercontact import AnswerContact
  from lp.bugs.model.bugmessage import BugMessage
  from lp.bugs.model.bugnotification import (
@@ -98,6 +98,10 @@
      SessionPkgData,
+     )
  from lp.services.worlddata.interfaces.language import ILanguageSet
++from lp.soyuz.model.binarypackagereleasecontents import (
++    BinaryPackageReleaseContents,
++    )
++from lp.soyuz.model.files import BinaryPackageFile
  from lp.testing import (
      person_logged_in,
      TestCase,
@@ -122,6 +126,11 @@
      def test_hourly_script(self):
          """Ensure garbo-hourly.py actually runs."""
++        # Our sampledata doesn't contain anythng that
++        # PopulateBinaryPackageReleaseContents can process without errors,
++        # so it's easier to just remove every BinaryPackageFile.
++        IMasterStore(BinaryPackageFile).find(BinaryPackageFile).remove()
++        transaction.commit()
          rv, out, err = run_script(
              "cronscripts/garbo-hourly.py", ["-q"], expect_returncode=0)
          self.failIf(out.strip(), "Output to stdout: %s" % out)
@@ -367,6 +376,13 @@
          self.log.addHandler(NullHandler())
          self.log.propagate = 0
++        self.layer.switchDbUser('testadmin')
++        # Our sampledata doesn't contain anythng that
++        # PopulateBinaryPackageReleaseContents can process without errors,
++        # so it's easier to just remove every BinaryPackageFile.
++        IMasterStore(BinaryPackageFile).find(BinaryPackageFile).remove()
++        transaction.commit()
++
          # Run the garbage collectors to remove any existing garbage,
          # starting us in a known state.
          self.runDaily()
@@ -948,6 +964,33 @@
          self.assertEqual(1, count)
++    def test_populate_bprc(self):
++        LaunchpadZopelessLayer.switchDbUser('testadmin')
++        bpr = self.factory.makeBinaryPackageRelease()
++        deb = open(datadir('pmount_0.9.7-2ubuntu2_amd64.deb'), 'r')
++        lfa = self.factory.makeLibraryFileAlias(
++            filename='pmount_0.9.7-2ubuntu2_amd64.deb', content=deb.read())
++        deb.close()
++        transaction.commit()
++        bpr.addFile(lfa)
++        transaction.commit()
++        self.runHourly()
++        bprc = IMasterStore(BinaryPackageReleaseContents).find(
++            BinaryPackageReleaseContents)
++        self.assertEqual(13, bprc.count())
++        paths = map(lambda x: x.binarypackagepath.path, bprc)
++        expected_paths = [
++            'etc/pmount.allow', 'usr/bin/pumount', 'usr/bin/pmount-hal',
++            'usr/bin/pmount', 'usr/share/doc/pmount/TODO',
++            'usr/share/doc/pmount/README.Debian',
++            'usr/share/doc/pmount/AUTHORS', 'usr/share/doc/pmount/copyright',
++            'usr/share/doc/pmount/changelog.gz',
++            'usr/share/doc/pmount/changelog.Debian.gz',
++            'usr/share/man/man1/pmount-hal.1.gz',
++            'usr/share/man/man1/pmount.1.gz',
++            'usr/share/man/man1/pumount.1.gz']
++        self.assertContentEqual(expected_paths, paths)
++
      def test_UnusedPOTMsgSetPruner_removes_obsolete_message_sets(self):
          # UnusedPOTMsgSetPruner removes any POTMsgSet that are
          # participating in a POTemplate only as obsolete messages.