git-ubuntu

Code review comment for ~nacc/git-ubuntu:lp1730734-cache-importer-progress

Revision history for this message

Nish Aravamudan (nacc) wrote on 2018-01-08:

On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote:
> I'm not sure I can help you with the "implications" as you asked on
> IRC without thinking much more into it. But I have a few questions
> that might help to clarify at least.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
Walk LIST foward after PUBLISH and IMPORT each unique source package
name

IMPORT itself is algorithmically:
   Get current REPOSITORY
   Get Launchpad publishing DATA for a given source package in reverse
chronological order
   Walk DATA backwards until we find a PUBLISH that we have already
imported (or earlier)
      Walk DATA forwards and import each publish record

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

In the prior importer code, we had a unique Git commit for every
publication event, because the targetted <series>-<pocket> was part of
the Git commit (via the commit message and parenting). So as we did
IMPORT's reverse walk, we could look at the branch tips and compare
their commit data (where we stored the publishing timestamp) to the
publishing records to find the last imported publishing record.

We dropped that ability, by dropping publishing parents altogether. We
now just import all published versions once, tying them together only by
their changelogs. We then forcibly move the branch tips.

So now if we use an unmodified importer, we will end up going back in
IMPORT to the last publish that was the first time we saw a given
version (typically in Debian, therefore) and walking forward from there.
This will be unnecessary iteration of publishing data, at least, and
unnecessary moving of the branch pointers.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
> If not lets just lock the DB to be of single use - problem solved.

Right, so we have a mode we know will exist at some point, where there
is a linear script walking all 'to-import' source packages getting them
loaded to Launchpad, and the keep-up script which is ensuring that
publishes that are happeninng while the linear walker is going get
integrated. So we do expect to see (and it shouldn't just break) if
there are two imports of the same source package going on. What we don't
want to happen is for a slower 'old' import (where the list of events to
import is older) to somehow trump a faster 'new' import and thus end up
setting the branch pointers to the wrong place(s).

The problem is, that if we just outright lock the DB, then the linear
script can't work if it's also using the DB. And the point of it is that
they are both import operations, using `git-ubuntu-import`, and if we
don't pass the same DB to both, then we will end up in the state I
mentioned above, I think. It's mostly efficiency at that point, I
think, where the linear script is going to do work that the catchup
script will also do. But with our redesign, I guess, that latter will be
a lot faster, since it will just see that the versions it needs to
iterate are already imported and can be skipped.

Hrm, that last sentence sounds good, but I think it actuallly needs code
to back it up.

> 2) If we want multiple processes to be allowed we could make the use of
> the cache exclusive per package. (I beg your pardon if I don't see the point why this would
> fail). So only one worker can be in said [package] at any time.
> Processes working on different packages should never conflict right?
> Another worker with a different [package] can continue and we know there won't be a race.
> Essentially locking on [package] (yes this might need a better db to support
> rollback/timeouts on aborted processes if you want to lock in the DB).
> The tail of the processing would be updating sphhr and then unlocking the given package.

The more I think about it, I think I am spinning myself on this problem
for no reason :)

I think we can let the linear walker not use the dbm [we can keep the
code, just not pass anything] (or sqlite) and it will just mean if there
is a race between when the linear walker hits a source package and when
the keep-up script does, there will be some no-op looping (unless new
SPRs are found).

> In general I think you should add tests along these commits that e.g.
> does "import-no-cache-used == import into cache == import with cache".

Yeah, that's a good idea. TBH, I'll need to think about how to add these
as 'unit tests', as we would need a cache to do so.

And as integration tests, I think it might be painful, as it will take
a long time to run.

Understood and thank you!

> To do so I want to hear more details on the actual use-case that lets
> us run into this race:

The race is internal, in some ways, to the importer, but you do bring up
an interesting second case.

`git ubuntu import` is managed by a looping script,
scripts/import-source-packages.py, which is operating in a keepup
fashion. Roughly:

Obtain a LIST of publishing events in reverse chronological order
Walk LIST backwards until we find a PUBLISH that we have
already imported (or earlier)
   Walk LIST foward after PUBLISH and IMPORT each unique source package
name

The issue we are trying to resolve here is the "until we find a PUBLISH
that we have already imported" in IMPORT.

My branch modifies the catch-up to move the storage of the publication
event to an external cache, currently DBM.

> 1) Do we even want that two importers run concurrently on the same system?
>    If not lets just lock the DB to be of single use - problem solved.

Hrm, that last sentence sounds good, but I think it actuallly needs code
to back it up.

> 2) If we want multiple processes to be allowed we could make the use of 
>    the cache exclusive per package. (I beg your pardon if I don't see the point why this would 
>    fail). So only one worker can be in said [package] at any time.
>    Processes working on different packages should never conflict right?
>    Another worker with a different [package] can continue and we know there won't be a race.
>    Essentially locking on [package] (yes this might need a better db to support 
>    rollback/timeouts on aborted processes if you want to lock in the DB).
>    The tail of the processing would be updating sphhr and then unlocking the given package.

The more I think about it, I think I am spinning myself on this problem
for no reason :)

> In general I think you should add tests along these commits that e.g.
> does "import-no-cache-used == import into cache == import with cache".

Yeah, that's a good idea. TBH, I'll need to think about how to add these
as 'unit tests', as we would need a cache to do so.

And as integration tests, I think it might be painful, as it will take
a long time to run.

« Back to merge proposal