On 08.01.2018 [20:17:52 -0000], ChristianEhrhardt wrote: > I'm not sure I can help you with the "implications" as you asked on > IRC without thinking much more into it. But I have a few questions > that might help to clarify at least. Understood and thank you! > To do so I want to hear more details on the actual use-case that lets > us run into this race: The race is internal, in some ways, to the importer, but you do bring up an interesting second case. `git ubuntu import` is managed by a looping script, scripts/import-source-packages.py, which is operating in a keepup fashion. Roughly: Obtain a LIST of publishing events in reverse chronological order Walk LIST backwards until we find a PUBLISH that we have already imported (or earlier) Walk LIST foward after PUBLISH and IMPORT each unique source package name IMPORT itself is algorithmically: Get current REPOSITORY Get Launchpad publishing DATA for a given source package in reverse chronological order Walk DATA backwards until we find a PUBLISH that we have already imported (or earlier) Walk DATA forwards and import each publish record The issue we are trying to resolve here is the "until we find a PUBLISH that we have already imported" in IMPORT. In the prior importer code, we had a unique Git commit for every publication event, because the targetted - was part of the Git commit (via the commit message and parenting). So as we did IMPORT's reverse walk, we could look at the branch tips and compare their commit data (where we stored the publishing timestamp) to the publishing records to find the last imported publishing record. We dropped that ability, by dropping publishing parents altogether. We now just import all published versions once, tying them together only by their changelogs. We then forcibly move the branch tips. So now if we use an unmodified importer, we will end up going back in IMPORT to the last publish that was the first time we saw a given version (typically in Debian, therefore) and walking forward from there. This will be unnecessary iteration of publishing data, at least, and unnecessary moving of the branch pointers. My branch modifies the catch-up to move the storage of the publication event to an external cache, currently DBM. > 1) Do we even want that two importers run concurrently on the same system? > If not lets just lock the DB to be of single use - problem solved. Right, so we have a mode we know will exist at some point, where there is a linear script walking all 'to-import' source packages getting them loaded to Launchpad, and the keep-up script which is ensuring that publishes that are happeninng while the linear walker is going get integrated. So we do expect to see (and it shouldn't just break) if there are two imports of the same source package going on. What we don't want to happen is for a slower 'old' import (where the list of events to import is older) to somehow trump a faster 'new' import and thus end up setting the branch pointers to the wrong place(s). The problem is, that if we just outright lock the DB, then the linear script can't work if it's also using the DB. And the point of it is that they are both import operations, using `git-ubuntu-import`, and if we don't pass the same DB to both, then we will end up in the state I mentioned above, I think. It's mostly efficiency at that point, I think, where the linear script is going to do work that the catchup script will also do. But with our redesign, I guess, that latter will be a lot faster, since it will just see that the versions it needs to iterate are already imported and can be skipped. Hrm, that last sentence sounds good, but I think it actuallly needs code to back it up. > 2) If we want multiple processes to be allowed we could make the use of > the cache exclusive per package. (I beg your pardon if I don't see the point why this would > fail). So only one worker can be in said [package] at any time. > Processes working on different packages should never conflict right? > Another worker with a different [package] can continue and we know there won't be a race. > Essentially locking on [package] (yes this might need a better db to support > rollback/timeouts on aborted processes if you want to lock in the DB). > The tail of the processing would be updating sphhr and then unlocking the given package. The more I think about it, I think I am spinning myself on this problem for no reason :) I think we can let the linear walker not use the dbm [we can keep the code, just not pass anything] (or sqlite) and it will just mean if there is a race between when the linear walker hits a source package and when the keep-up script does, there will be some no-op looping (unless new SPRs are found). > In general I think you should add tests along these commits that e.g. > does "import-no-cache-used == import into cache == import with cache". Yeah, that's a good idea. TBH, I'll need to think about how to add these as 'unit tests', as we would need a cache to do so. And as integration tests, I think it might be painful, as it will take a long time to run.