Code review comment for lp:~bzr/bzr/faster-commit-file

Revision history for this message
Ian Clatworthy (ian-clatworthy) wrote :

Follow-up IRC discussion re review and next steps ...

(14:46:07) igc: lifeless: so, I might be stupid but I still don't see the problem ...
(14:46:13) lifeless: ok
(14:46:24) igc: iter_changes only needs to return changed parents, not all of them
(14:46:46) lifeless: right, but unless its looking at parents it won't return them, because iter_changes has a bug
(14:46:52) igc: i.e. if a parent isn't in iter_changes, then it isn't needed in apply_delta IIUIC
(14:47:22) igc: lifeless: but it *is* looking at parent right now
(14:47:27) lifeless: how?
(14:47:33) igc: it looks at the whole tree, then post-filters
(14:47:59) lifeless: ow
(14:48:17) lifeless: I'm really torn
(14:48:17) igc: and osutils.parent_directories doesn't need to return the root
(14:48:31) igc: at the os level, '' doesn't make sense
(14:48:52) lifeless: well, other code would be cleaner if it does - as you are making sure you have [''] always
(14:49:05) igc: also, my change to when record_iter_changes was deliberate ...
(14:49:15) igc: and the comment still hold exactly as is btw
(14:49:26) igc: I benchmarked to decide when it was the right thing fwiw
(14:51:24) lifeless: why the (len(self.parents) < 2 and not self.specific_files and not self.exclude)
(14:51:27) lifeless: clause?
(14:51:46) igc: because for non-chk formats, ...
(14:52:21) igc: it's slower than the current code otherwise
(14:52:29) igc: at least on Emacs
(14:53:17) igc: I'm happy to extend the comment along those lines
(14:53:42) lifeless: So this means that the self.specific_files and self.exclude processing with record_iter_changes is _slower_ than the old full-tree code path
(14:54:02) lifeless: and that its only faster with record_iter_changes on chk repositoreis because the chk code is able to compensate
(14:54:13) lifeless: I appreciate that you've got it going faster
(14:54:17) igc: yes, according to my measurements
(14:54:47) lifeless: I'm really worried that the other changes, such as the reinstance of files_across_trees are going to make it harder for someone to come bacj and fix bug 347649 properly
(14:54:48) igc: it was consistently slower - 08. vs 04 or something like that
(14:54:48) ubottu: Launchpad bug 347649 in bzr "iter_changes missing support needed for commit" [High,Confirmed] https://launchpad.net/bugs/347649
(14:54:56) lifeless: as they will have to basically undo this patch to fix it
(14:55:28) lifeless: on the other hand I don't want to block performance improvements
(14:55:35) igc: lifeless: I did look at trying to remove the file_across_trees bit but ...
(14:55:44) igc: it broke the test suite cos ...
(14:55:45) lifeless: do you have any suggestion about how we can do both things?
(14:55:56) igc: we trap for bogus include paths
(14:56:06) igc: and iter_changes doesn't see those
(14:56:21) lifeless: file_across_trees is a problem because it upcassts the dirstate to an inventory - its spectcularly slow
(14:56:44) lifeless: emacs is what 20K files?
(14:56:59) igc: just checking, we're talking about the _set_specific_file_ids() method yes?
(14:57:06) lifeless: yes
(14:57:25) igc: emacs is 8K from memory
(14:57:31) igc: 100K revisions but ...
(14:57:38) igc: that doesn't matter myuch here
(14:57:42) lifeless: ok, so 16K inventory entries
(14:58:19) lifeless: I don't feel good about this code
(14:58:57) lifeless: so I'm going to recuse my self as a review at this point; I don't want to block you, and I haven't managed to open your eyes to why the layering is a problem or matters so much
(14:59:25) lifeless: I'll be effectively offline for 2 weeks, and I don't like the idea of you being blocked for that long.
(14:59:34) igc: I see and agree with the desire for one iter_changes call covering ...
(14:59:50) igc: legal filename checking, strict checking and collecting results
(15:00:49) igc: but I don't see it being necessary to get there in a single step
(15:01:20) lifeless: I bet you cold cache specific file commits will be way slower with your patch
(15:01:39) lifeless: particularly on the large trees I've been working up to - 50K and 100K files
(15:02:00) igc: slower than now or slower than possible?
(15:02:08) lifeless: slower than now
(15:02:25) lifeless: I could be wrong
(15:03:57) lifeless: but what you trade off is some unoptimised code (CHKRepository.add_inventory) for some optimised code that does a lot of IO (WT.iter_changes of everything)
(15:04:49) igc: I'll throw together the OOo tree and benchmark on it
(15:04:52) mtaylor [n=mtaylor@74-61-48-25.sea.clearwire-dns.net] entered the room.
(15:05:04) lifeless: cold cache will be key to see the disk IO impact
(15:05:18) lifeless: I can suggest a completely different approach if you just want to make it faster in the interim
(15:05:22) igc: lifeless: design wise, I have a question ...
(15:05:39) igc: if we delegate specific_file prcoessing to iter_changes ...
(15:05:40) lifeless: (an approach I wouldn't feel unhappy about)
(15:06:02) igc: how we we accurately insert the parents after the fact?
(15:06:07) igc: s/we/can/
(15:06:16) igc: we don't have all the info?
(15:06:21) lifeless: igc: iter_changes is allowed to jump around
(15:06:37) lifeless: igc: it already does this for renames
(15:07:05) lifeless: iter_changes has built into it the logic of find_ids_across_trees
(15:07:13) lifeless: anyway
(15:07:23) lifeless: if you want something easy
(15:07:34) lifeless: add a parameter to finish_inventory called 'basis_inventory'
(15:07:45) lifeless: and if use_record_iter_changes is False in commit
(15:07:55) lifeless: pass self.basis_inventory to finish_inventory
(15:08:05) lifeless: and in finish_inventory do:
(15:08:44) lifeless: if self.repository._format._commit_inv_deltas
(15:08:57) lifeless: delta = self.new_inventory._make_delta(basis_inventory)
(15:09:10) lifeless: self.repository.add_inventory_by_delta(basis_inventory, basis_inventory.revision_id)
(15:09:14) lifeless: ^
(15:09:45) lifeless: this will work around finish_inventory being overly slow in CHK repositories when not using record_iter_changes
(15:10:17) lifeless: and as you'll already have a basis_inventory it should be nigh free
(15:11:23) igc: lifeless: ok, I'll play with that next week and try it on some larger data sets
(15:12:32) lifeless: igc: I would expect that to be ~= to what you have today, and deal with cold cache and other situations much better
(15:12:42) lifeless: no where near as fast as fixing iter_changes
(15:12:57) lifeless: I would expect a fixed iter_changes to smoke everything
(15:14:06) lifeless: as for what to do in iter_changes to handle parent dirs
(15:14:19) lifeless: I would keep a minimal set of directory items output
(15:14:31) lifeless: and a 'we need a parent of X' queue
(15:14:47) lifeless: at the end of the normal loop, if there 'we need a parent of X' is non-empty
(15:15:13) lifeless: we do the normal loop on just the missing parents
(15:15:17) lifeless: without recursion

« Back to merge proposal