Bazaar

Code review comment for lp:~ian-clatworthy/bzr/faster-dirstate-saving

faster-dirstate-saving
Merge into trunk-old

Revision history for this message

Ian Clatworthy (ian-clatworthy) wrote on 2009-05-28:

Many commonly used commands like status, add and commit update the dirstate, triggering a dirstate serialisation & save. On huge trees like OpenOffice, this is slower than it needs to be. In particular, 'xxx status file' takes 0.4 seconds in hg and 1.0 seconds in bzr and a good percentage of the difference is due to the time we take to serialise the new dirstate.

This branch is an experiment/RFC in fixing that. It drops the time for 'bzr status file' by 30-35% down to 0.65-0.70 seconds. It does that by remembering the serialised form of entries and only re-serialising entries that are known to be changed. Right now, this smart remembering of what's changed is only effectively implemented for status, though the internal API is in place for extending that to other use cases.

Of course, there are other ways of skinning this cat. One option is to write a pyrex serialiser. That ought to be fast but it still doesn't solve the root problem: serialisation time is O(size-of-tree) currently because we only keep a modified vs unmodified flag at the whole-of-dirstate level. Another option is to append 'overlays' to the dirstate file, i.e. entries which have been added or changed vs the base entries. Deletes or renames would trigger a full clean write but the common cases of add and/or change would just append entries. That's non-trivial but potentially very fast.

More broadly, I think the important thing to be begin recording the changes as this patch allows. So my current thoughts are that we ought to start with this patch, make the changes to enable smarter recording for add and commit, and built from there. At any point, we can separately do a pyrex serialiser and it will complement this work.

Having said all that, dirstate is my least favourite part of the Bazaar code base: indexing into tuples using magic integers may be fast but it sucks from an understandability perspective (vs objects + attributes). There are people far more qualified than I to say how this ought to proceed and to write the code, but they're rather busy tackling other things. Regardless, it's been a good exercise for me in getting dirstate paged into my head for other work I'm doing. It's a step forward but I can't definitively say it's in the right direction.

Thoughts?

« Back to merge proposal