bzr: ERROR: bzrlib.errors.RetryWithNewPacks: Unprintable exception RetryWithNewPacks

Bug #709349 reported by Jean-Paul Calderone
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
In Progress
High
Unassigned

Bug Description

In the same setup described on #701940, but with bzr upgraded to 2.2.3, an update failed with a traceback like this:

Traceback (most recent call last):
  File "/usr/lib/python2.6/dist-packages/bzrlib/commands.py", line 912, in exception_to_return_code
    return the_callable(*args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/commands.py", line 1112, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.6/dist-packages/bzrlib/commands.py", line 690, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/usr/lib/python2.6/dist-packages/bzrlib/commands.py", line 705, in run
    return self._operation.run_simple(*args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/cleanup.py", line 135, in run_simple
    self.cleanups, self.func, *args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/cleanup.py", line 165, in _do_with_cleanups
    result = func(*args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/builtins.py", line 1342, in run
    accelerator_tree, hardlink)
  File "/usr/lib/python2.6/dist-packages/bzrlib/plugins/svn/branch.py", line 376, in create_checkout
    hardlink=hardlink)
  File "/usr/lib/python2.6/dist-packages/bzrlib/plugins/svn/branch.py", line 321, in _create_heavyweight_checkout
    checkout_branch.pull(self, stop_revision=revision_id)
  File "/usr/lib/python2.6/dist-packages/bzrlib/branch.py", line 1068, in pull
    possible_transports=possible_transports, *args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/plugins/svn/branch.py", line 817, in pull
    stop_revision, overwrite, limit=limit)
  File "/usr/lib/python2.6/dist-packages/bzrlib/plugins/svn/branch.py", line 749, in _update_revisions
    project=self.source.project, mapping=self.source.mapping)
  File "/usr/lib/python2.6/dist-packages/bzrlib/plugins/svn/fetch.py", line 1528, in fetch
    self.target.pack(hint=pack_hint)
  File "/usr/lib/python2.6/dist-packages/bzrlib/decorators.py", line 194, in write_locked
    result = unbound(self, *args, **kwargs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/pack_repo.py", line 2419, in pack
    self._pack_collection.pack(hint=hint, clean_obsolete_packs=clean_obsolete_packs)
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/pack_repo.py", line 1579, in pack
    self._try_pack_operations(hint)
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/pack_repo.py", line 1600, in _try_pack_operations
    reload_func=self._restart_pack_operations)
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/groupcompress_repo.py", line 690, in _execute_pack_operations
    result = packer.pack()
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/pack_repo.py", line 746, in pack
    return self._create_pack_from_packs()
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/groupcompress_repo.py", line 472, in _create_pack_from_packs
    self._copy_revision_texts()
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/groupcompress_repo.py", line 410, in _copy_revision_texts
    'revisions', self._get_progress_stream, 1)
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/groupcompress_repo.py", line 399, in _copy_stream
    reuse_blocks=False):
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1653, in _insert_record_stream
    for record in stream:
  File "/usr/lib/python2.6/dist-packages/bzrlib/repofmt/groupcompress_repo.py", line 207, in pb_stream
    for idx, record in enumerate(substream):
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1418, in get_record_stream
    self._access.reload_or_raise(e)
  File "/usr/lib/python2.6/dist-packages/bzrlib/knit.py", line 3408, in reload_or_raise
    elif not self._reload_func():
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1413, in get_record_stream
    orig_keys, ordering, include_delta_closure):
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1572, in _get_remaining_record_stream
    for factory in batcher.yield_factories(full_flush=True):
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1130, in yield_factories
    block_read_memo, block = blocks.next()
  File "/usr/lib/python2.6/dist-packages/bzrlib/groupcompress.py", line 1369, in _get_blocks
    zdata = raw_records.next()
  File "/usr/lib/python2.6/dist-packages/bzrlib/knit.py", line 3381, in get_raw_records
    exc_info=sys.exc_info())
RetryWithNewPacks: Unprintable exception RetryWithNewPacks: ....

The full exception (very long) is in the attached crash report. Note that like #701940, this is part of an automated build system. Something I realized since I filed the other ticket is that there are two builders using the shared repository this exception references, and they almost certainly both issued a pull of the same new revisions at the same time.

Revision history for this message
Jean-Paul Calderone (exarkun) wrote :
Revision history for this message
Jean-Paul Calderone (exarkun) wrote :

Also, the state of the shared repository after this failure is like it was in #701940. There are files in obsolete_packs which I have to move back to packs and indices before the repository is usable again.

Revision history for this message
Andrew Bennetts (spiv) wrote :

This suggests that the fix for bug 701940 isn't complete :(

I guess I'll try making some more complex and/or pessimistic tests and see if I can reproduce.

Changed in bzr:
assignee: nobody → Andrew Bennetts (spiv)
importance: Undecided → High
status: New → Confirmed
Jelmer Vernooij (jelmer)
tags: added: packs
Andrew Bennetts (spiv)
Changed in bzr:
status: Confirmed → In Progress
Revision history for this message
Andrew Bennetts (spiv) wrote :

Ok, I've got a trivial fix for the “Unprintable exception”, and with some hackery to inject a NoSuchFile error at the appropriate time I have a test that reproduces this traceback. Turning that into a tasteful test might be tricky, but it's a good start.

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 709349] Re: bzr: ERROR: bzrlib.errors.RetryWithNewPacks: Unprintable exception RetryWithNewPacks

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 3/14/2011 8:12 AM, Andrew Bennetts wrote:
> Ok, I've got a trivial fix for the “Unprintable exception”, and with
> some hackery to inject a NoSuchFile error at the appropriate time I have
> a test that reproduces this traceback. Turning that into a tasteful
> test might be tricky, but it's a good start.
>

RetryWithNewPacks isn't supposed to ever bubble up to the user. If it
is, then we have another bug. Because you are only supposed to have set
the _reload_func when you are prepared to catch the exception.

Anyway, I'm happy to have a nicer traceback for bugs, but I did want to
point out there seems to be a logic issue as well.

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk199sYACgkQJdeBCYSNAAML+wCgo40q8FsYgVtLRTK6Xu9jZ1+O
+5UAoNDXdL8tqZTZhJ+v4FnoKZTMHDk4
=mmXu
-----END PGP SIGNATURE-----

Revision history for this message
Andrew Bennetts (spiv) wrote :

John Arbash Meinel wrote:
> On 3/14/2011 8:12 AM, Andrew Bennetts wrote:
> > Ok, I've got a trivial fix for the “Unprintable exception”, and with
> > some hackery to inject a NoSuchFile error at the appropriate time I have
> > a test that reproduces this traceback. Turning that into a tasteful
> > test might be tricky, but it's a good start.
> >
>
> RetryWithNewPacks isn't supposed to ever bubble up to the user. If it
> is, then we have another bug. Because you are only supposed to have set
> the _reload_func when you are prepared to catch the exception.
>
> Anyway, I'm happy to have a nicer traceback for bugs, but I did want to
> point out there seems to be a logic issue as well.

Yes, I agree, and that's what I meant. Sorry, I can see how that was
unclear :)

I've reproduced the error, and preventing that error from being raised
is the main issue. But I wanted to mention I've also fixed the trivial
bug causing Unprintable exception.

Revision history for this message
Andrew Bennetts (spiv) wrote :

So, I'm stumped. Here's the situation as I understand it:

 * While performing a pack (explicitly triggered by bzr-svn), a read from a pack file named in pack-names failed with NoSuchFile
 * NoSuchFile there is a normal occurrence if another process concurrently packed or autopacked in that repository.
 * The code (_DirectPackAccess.get_raw_records) catches NoSuchFile, and raises RetryWithNewPacks to signal to the caller that it needs to retry
 * The caller, GroupCompressVersionedFiles.get_record_stream, catches RetryWithNewPacks, and calls reload_or_raise
 * The reload_or_raise calls the reload_func that was passing in from higher up the stack, _restart_pack_operations
 * _restart_pack_operations first calls reload_pack_names

Now, here's what is *supposed* to happen next:

 * reload_pack_names returns True (i.e. yes, the pack names changed)
 * _restart_pack_operations raises RetryPackOperations to signal that the whole pack operation (_try_pack_operations) driven from RepositoryPackCollection.pack needs to be restarted (because we were planning to rewrite a pack that is now gone we need to restart more than just the individual read of some records from the pack)
 * RepositoryPackCollection.pack catches RetryPackOperations, calls _try_pack_operations again, and _try_pack_operations is robust in the face of the 'hint' param referring to packs that no longer exist (as is likely)

Here's what I infer has happened next:

 * reload_pack_names returned False: because ensure_loaded was definitely already called this means that the pack-names had not changed
 * so we now have the situation that a pack named in the 'hint' param, and that must have existed in pack-names for _try_pack_operations to have included it in the list of packs to repack, does not exist on disk
 * this is a logically impossible situation which the code intentionally reacts to by reraising the RetryWithNewPacks error

Barring OS/filesystem weirdness (like removal of the .pack file being visible to this process before the new pack-names content is visible, despite that being very much not the order these operations occur in), I don't see how this can happen. The crash file says the system involved is Lucid, what's the filesystem involved?

Perhaps we need to be more paranoid and doing explicit fsync or similar on pack-names rather than just relying on atomic rename on POSIX? In that case it's a bit surprising that this buildslave is the only report of this problem, though.

Some incidental observations:

 * it's too hard to see what has happened by looking at a traceback when re-raising has happened. We probably should at least mutter something to the log file whenever we do a re-raise to make it clear what code paths are involved, but...
 * the BzrLogTail in the crash file is too short to show us any mutters in this case anyway. It'd be good to always include the entire log contents for this process, I think, or at least increase the BzrLogTail size to be longer than the largest typical traceback. Perhaps just doubling the current size would suffice.

FWIW, the branch I have so far is: <lp:~spiv/bzr/retry-with-new-packs-709349-2.3>, but it's not particularly interesting.

Revision history for this message
Andrew Bennetts (spiv) wrote :

I've spoken with Jean-Paul, and the filesystem involved was either ext3 or ext4 (he's not certain exactly which of Twisted's buildslaves this crash came from). Either way, I don't think the filesystem is the culprit. If there were a network filesystem involved then coherency problems like this might be possible. It's hard to imagine how repository writes by one process would cause Linux to present a second, concurrent process with only some of the later filesystem updates. (ext3/4 can have somewhat lax guarantees about what is written to disk in what order, but that's mainly an issue if the system crashes or loses power, which is not the scenario here.)

So I'm left with no good hypotheses, except the obvious “there's something I'm missing” :)

Jean-Paul is going to reenable the relevant shared repository sometime after PyCon (the buildslaves were switched to standalone repos to workaround this issue) so we can gather more data. Apparently they were seeing this error roughly once per day.

Revision history for this message
Andrew Bennetts (spiv) wrote :

Technically this is Incomplete as we're stalled on needing more info, so marking as such to keep my todo list tidy, even though it's also In Progress in a sense.

Changed in bzr:
status: In Progress → Incomplete
Revision history for this message
Robert Collins (lifeless) wrote :

Two things...

incomplete bugs expire now, so unless you are happy with the bug
disappearing, I would not use 'incomplete'.

Secondly, does twisted have windows build slaves? We've had
poor-coherency issues with windows before.

Revision history for this message
Martin Pool (mbp) wrote :

On 23 March 2011 17:38, Robert Collins <email address hidden> wrote:
> Two things...
>
> incomplete bugs expire now, so unless you are happy with the bug
> disappearing, I would not use 'incomplete'.

Hooray for expiry.

I think Incomplete is appropriate iff the bug cannot progress without
more information; this bug is probably in that state.

Revision history for this message
Jean-Paul Calderone (exarkun) wrote :

Here is a program which invokes bzr concurrently and reproduces the obsolete_packs problem.

Revision history for this message
John A Meinel (jameinel) wrote :

Here is what I get on Windows:
Updating FilePath('c:\\users\\jameinel\\appdata\\local\\temp\\tmpvcldzh\\tkrzbu')
['C:\\Python26\\python.exe', '/Users/jameinel/dev/bzr/bzr.dev/bzr', 'pull', '-d', 'c:\\users\\jamein
el\\appdata\\local\\temp\\tmpvcldzh\\2hnyc5'] => Using saved parent location: C:/users/jameinel/appd
ata/local/temp/tmpk_ca5g/
bzr: ERROR (ignored): 'file:///C:/users/jameinel/appdata/local/temp/tmpvcldzh/.bzr/repository/upload
/ixn21ph3jwgmlld785iy.pack'
bzr: ERROR: File exists: u'C:/users/jameinel/appdata/local/temp/tmpvcldzh/.bzr/repository/upload/ixn21ph3jwgmlld785iy.pack': [Error 183] Cannot create a file when that file already exists

It would appear that we are getting a name collision. So one update script is creating a pack with exactly the same name as the other. One of them is thinking that this pack should be obsolete, and the other is thinking the pack file should be added.

Because of the exact content match, you end up with confusion on the 2-way diff of pack-names.
 1) Process 1 creates abcd.pack
 2) Process 2 fills upload/ with content that would create abcd.pack
 3) Process 1 starts repacking, seeing that abcd.pack is now obsolete (the content will be combined into a different pack)
 4) Process 1 updates pack-names indicating that it will now remove abcd.pack, at this point, it is considered safe to move abcd.pack to obsolete_packs because it is no longer referenced.
 5) Process 2 reads pack-names and sees that abcd.pack does not exist, so adds it and releases the write lock
 6) Process 2 renames the new pack from upload to packs/abcd.pack
 7) Process 1 renames abcd.pack to obsolete_packs

If this is true, something is able to generate pack files that are colliding. This shouldn't really be possible, since if we have the content, we shouldn't think we need to pull it in again.

I guess if you are doing concurrent pulling from the same branch to the same repository, since the first update hasn't finished yet, it doesn't see the content is already present. And the second fetch starts before the first one finishes, doing exactly the same work.

I don't have an immediate solution, but I do think this could be a handle on the problem.

Revision history for this message
Martin Pool (mbp) wrote :

On 24 March 2011 06:48, John A Meinel <email address hidden> wrote:
> If this is true, something is able to generate pack files that are
> colliding. This shouldn't really be possible, since if we have the
> content, we shouldn't think we need to pull it in again.
>
> I guess if you are doing concurrent pulling from the same branch to the
> same repository, since the first update hasn't finished yet, it doesn't
> see the content is already present. And the second fetch starts before
> the first one finishes, doing exactly the same work.

That sounds entirely possible. It seems like we need to do something
like just checking whether a file identical to the one we want to
generate is now present, rather than restarting as such. I thought we
already did that.

Revision history for this message
Andrew Bennetts (spiv) wrote :

Thanks very much for that script! Let's put this back on my queue.

Changed in bzr:
status: Incomplete → In Progress
Revision history for this message
Robert Collins (lifeless) wrote :

"""If this is true, something is able to generate pack files that are colliding. This shouldn't really be possible, since if we have the content, we shouldn't think we need to pull it in again.

I guess if you are doing concurrent pulling from the same branch to the same repository, since the first update hasn't finished yet, it doesn't see the content is already present. And the second fetch starts before the first one finishes, doing exactly the same work.

I don't have an immediate solution, but I do think this could be a handle on the problem.
"""

branching the same upstream revision into a shared repository from two branches will generate colliding packs.

I think the race is well analyzed. Possible fixes that come to mind:
 - move to obsolete during the critical section
 - read-after write [check the pack isn't still present[seems still racy]]

Revision history for this message
John A Meinel (jameinel) wrote :

I wonder if this is related to bug #165293 at all.

Revision history for this message
Andrew Bennetts (spiv) wrote :

A quick update on where I am:

This code is pretty racy. There's definitely the race that John describes, but I suspect there are others. The breakbzr.py script fails in a variety of different ways, depending (presumably) on timing:

 * NoSuchRevision during _walk_to_common_revisions.
 * KeyError: 0, during iter_entries in btree_index.py
 * ShortReadvErrors reading cix files that are half the expected size, also during iter_entries
 * FileExists as John reports, if using a filesystem/OS that doesn't allow renaming over existing files.
 * NoSuchFile when one process obsoletes a pack another process is concurrently adding

Doing the move to obsolete during the critical section doesn't appear to be sufficient.

I'm working making the state involved more explicit and thus more testable, so we can comprehensively test the many possible races.

Andrew Bennetts (spiv)
Changed in bzr:
assignee: Andrew Bennetts (spiv) → nobody
Revision history for this message
Martin Packman (gz) wrote :

I've proposed just the exception formatting fix by Andrew for merging:

<https://code.launchpad.net/~gz/bzr/2.3_unprintable_retrywithnewpacks/+merge/81827>

Martin Packman (gz)
tags: added: affects-twisted
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.