Push efficiency: MissingObjectFinder yields too many unnecessary objects

Bug #562676 reported by Jelmer Vernooij
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Dulwich
Fix Released
High
Artem Tikhomirov

Bug Description

Reported by abderrahim on github:

That can go from as little as all the files in the pushed revisions (not just the changed files) to the whole branch back to its root (in case a branch is made off an old revision).
This script should be equivalent to git rev-list --objects (except that it doesn't resolve refs or short hashes), but it gives much more, for example:
for a single revision :
    $ git rev-list --objects 088be0574f98b5eca85f94e38f7647ef465876a9 ^a5d43f6cc88834d881e5ed763cb08d14b4afd7d0 |wc -l
    5
    $ python ../dul-rev-list 088be0574f98b5eca85f94e38f7647ef465876a9 ^a5d43f6cc88834d881e5ed763cb08d14b4afd7d0 |wc -l
    1762
for a branch off an old revision
    $ python ../dul-rev-list 051a598458657f2a2fd28ab791d053952274db82 ^a5d43f6cc88834d881e5ed763cb08d14b4afd7d0 |wc -l
    48723
    $ git rev-list --objects 051a598458657f2a2fd28ab791d053952274db82 ^a5d43f6cc88834d881e5ed763cb08d14b4afd7d0 |wc -l
    424

Here is the script : http://paste.lisp.org/+21VT

Tags: performance
Jelmer Vernooij (jelmer)
description: updated
Revision history for this message
Eric Drechsel (ericdrex) wrote :

jesus.

I mean, uh, here is a related issue downstream: http://github.com/schacon/hg-git/issues/#issue/32

Jelmer Vernooij (jelmer)
Changed in dulwich:
status: New → Triaged
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: none → 0.6.1
Jelmer Vernooij (jelmer)
Changed in dulwich:
importance: Undecided → High
Jelmer Vernooij (jelmer)
tags: added: performance
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: 0.6.1 → 0.7.1
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: 0.7.1 → none
Jelmer Vernooij (jelmer)
Changed in dulwich:
status: Triaged → In Progress
assignee: nobody → Jelmer Vernooij (jelmer)
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

The main problem is that dulwich doesn't look at the ancestors of the "have" revisions to determine what not to send.

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

Patch for MissingObjectFinder is attached. The fix doesn't include a test. The functionality has been checked using `git rev-list --objects` approach (as outlined in the bug description) for the same repository and diff of the output. hg-git with updated dulwich works fine. Existing dulwich tests pass without error.

The idea of the fix is straightforward: let H is complete ancestry of what remote already 'haves' , W is complete ancestry of what server 'wants'. difference (W,H) aka missing_commits in the patch, is what we need to walk and send (objects_to_send). intersection(W, H) , aka common_commits, provides base tree (tree elements both local and remote have in common). These base tree elements are recorded as 'already processed' (in sha_done), so that only changed blobs and trees will be considered when iterating over missing_commits.

Thus, only intialization of MissingObjectFinder has been altered, the rest of the logic is left untouched. Set of all commits to send, commits not to send and tree items not to send are built beforehand. Tree elements, blobs and tags to send are build using the same logic as before.

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

Few tests for the MissingObjectFinder patch above. Basic case with linear commits and more complicated with fork and merge in the history are covered. Besides, tests ensure erroneous SHAs in haves and wants are tolerated.

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

Updated patch for the fix. Now fails with KeyError if unknown SHA1s are in 'haves' or 'wants'.
One-lines parse_*() methods inlined.

Jelmer Vernooij (jelmer)
Changed in dulwich:
assignee: Jelmer Vernooij (jelmer) → nobody
Revision history for this message
Jelmer Vernooij (jelmer) wrote :
Download full text (17.5 KiB)

I'm getting test failures with the latest version of the patch:

PYTHONPATH=.: python -m unittest dulwich.tests.test_suite
.....................................................................ss.s.s.....................s.s..................................s..................................................................................................................................................................................................................................................s......................s.........................................................................................................................................................................................................................................s....E.E.......E.E.......E.E.......E.E.......................Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 86, in run
    self.finish_response()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 127, in finish_response
    self.write(data)
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 210, in write
    self.send_headers()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 268, in send_headers
    self.send_preamble()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 195, in send_preamble
    self._write('Server: %s\r\n' % self.server_software)
  File "/usr/lib/python2.7/socket.py", line 324, in write
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 59343)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 704, in finish
    self.wfile.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe
----------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 86, in run
    self.finish_response()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 127, in finish_response
    self.write(data)
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 210, in write
    self.send_headers()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 268, in send_headers
    self.send_preamble()
  File "/usr/lib/python2.7/wsgiref/handlers.py", line 195, in send_preamble
    self._write('Server: %s\r\n' % self.server_software)
  File "/usr/lib/python2.7/socket.py", line 324, in write
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self...

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

Update #2. SHA1s reported as 'haves' are not necessarily present locally, hence MissingObjectFinder shall not fail with KeyError when encounter them.

All tests pass, although "error: [Errno 32] Broken pipe" messages persist.

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

Unfortunately this reintroduces the previous issues and introduces some more: :-(

  File "dulwich/repo.py", line 893, in fetch_objects
    haves = self.object_store.find_common_revisions(graph_walker)
  File "dulwich/object_store.py", line 185, in find_common_revisions
    sha = graphwalker.next()
  File "dulwich/server.py", line 420, in next
    return self._impl.next()
AttributeError: 'NoneType' object has no attribute 'next'

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

 Could you please elaborate which particular repository you test against? I didn't see anything like that with the test suit the last time I ran it.
 Besides, it's not clear what 'previous issues' the fix reintroduces? There was KeyError when there were unrecognized revisions among values in 'haves'. There's no way KeyError is thrown with the last version of the patch. Did you mean any other 'previous issue'?
 Perhaps, if you'd add a failing test to test_missing_obj_finder.py with tailored repository it would be much easier to reproduce and tackle the issue?

Revision history for this message
Artem Tikhomirov (artem-tikhomirov) wrote :

BTW, the stacktrace has nothing to to with the patch, imo. The patch doesn't touch object_store.find_common_revisions(). Changes intoduced to object_store.find_missing_objects get their chance to spoil anyting *after* find_common_revisions() is over (as one can tell from repo.fetch_objects()).

Revision history for this message
Jelmer Vernooij (jelmer) wrote :

I'm not sure what was going on there, I suspect I might have been testing against an older version of the patch.

I've now merged this, with some cosmetic issues (PEP8) fixed up - I didn't really want to go through another iteration.

Thanks for the patch, and your patience. Sorry it took so long to get this landed.

Changed in dulwich:
status: In Progress → Fix Committed
assignee: nobody → Artem Tikhomirov (artem-tikhomirov)
Jelmer Vernooij (jelmer)
Changed in dulwich:
milestone: none → 0.9.0
status: Fix Committed → Fix Released
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

I'm going to revert this change, as we're in the process of relicensing Dulwich to GPLv2+ or Apachev2+. I've tried to contact Artem and Syntevo multiple times but have not heard back from them that they are happy with relicensing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.