Bazaar

Merge lp:~vila/bzr/gdfo-heads into lp:~bzr/bzr/trunk-old

gdfo-heads
Merge into trunk-old

Proposed by Vincent Ladeuil on 2009-06-18

Status:	Merged
Merged at revision:	not available
Proposed branch:	lp:~vila/bzr/gdfo-heads
Merge into:	lp:~bzr/bzr/trunk-old
Diff against target:	1180 lines
To merge this branch:	bzr merge lp:~vila/bzr/gdfo-heads
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
John A Meinel		2009-06-18	Needs Fixing on 2009-06-18
Review via email: mp+7651@code.launchpad.net

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-06-18:

This is a new implementation of KnownGraph.heads() and a companion _find_gdfo().

The main target is annotate, this doesn't address the revno
calculation which is certainly the next target (all numbers below
referring to annotate are with --show-ids (aka, the improvements
are due to the per-file graph better processing, not the revision
graph)).

Climbing on Jonh's shoulders (all the test and the pyrex plumbing
was there, waiting ;-P) I was able to improve the performances for
"dense" graphs (think mysql). The performance for bzr improves
too, but in a less spectacular way.

This patch also includes a fix by John for its implementation
that also give some spectacular results, keep that in mind when
reading the numbers.

Below are some results for 1.16, bzr.dev@4459, john's version
(aka an unreleased one, see
revid:<email address hidden> in
this branch) and this proposed version (which supersedes John's
one).

I've used NEWS for bzr as a reference point in the bzr.dev@4459
tree, and sql/mysqld.cc in dev6 conversion of mysql-6.0@2791.

I've also used tools/time_graph.py to measure the performances
more precisely.

Annotate bzr NEWS:
1.16: real 12.36 user 12.27 sys 0.04
trunk: real 6.84 user 6.80 sys 0.03
john: real 6.76 user 6.11 sys 0.12
this: real 6.00 user 5.96 sys 0.03

Annotate mysql sql/mysqld.cc:
1.16: real 21.66 user 21.52 sys 0.14
trunk: real
john: real 19.97 user 19.62 sys 0.17
this: real 18.69 user 18.56 sys 0.14

time_graph bzr, 25378 nodes, 7586 combinations:
1.16: N.A.
trunk: python 127.070s pyrex 49.800s
john: python 1.030s pyrex 0.700s
this: python 0.960s pyrex 0.540s

time_graph mysql, 67633 nodes, 33740 combinations:
1.16: N.A.
trunk: python 3470.620s pyrex 1751.760s
john: python 34.190s pyrex 36.480s
this: python 33.750s pyrex 17.670s

So while John fixed version vastly improves that specific part
(gdfo rules), the proposed new algorithms goes even further for
'bushy' graphs.

Unlike John, I couldn't find a way to make good use of 'linear
dominators' (either my pyrex fu is weak (two days old) or they'll
be useful elsewhere (I'm sure about that ;-)). Anyway, trying to
use them only decrease the performance and increase the memory
consumption, so, finally I get rid of them.

There are certainly better optimisations to put in place at the
pyrex level, but I thought the improvement was worth a review.

This is a new implementation of KnownGraph.heads() and a companion _find_gdfo().

Climbing on Jonh's shoulders (all the test and the pyrex plumbing
was there, waiting ;-P) I was able to improve the performances for
"dense" graphs (think mysql).  The performance for bzr improves
too, but in a less spectacular way.

This patch also includes a fix by John for its implementation
that also give some spectacular results, keep that in mind when
reading the numbers.

Below are some results for 1.16, bzr.dev@4459, john's version
(aka an unreleased one, see
revid:john@arbash-meinel.com-20090617144058-lfn4u111x6cn3ihr in
this branch) and this proposed version (which supersedes John's
one).

I've used NEWS for bzr as a reference point in the bzr.dev@4459
tree, and sql/mysqld.cc in dev6 conversion of mysql-6.0@2791.

I've also used tools/time_graph.py to measure the performances
more precisely.

Annotate bzr NEWS:
1.16:   real 12.36 user 12.27 sys 0.04
trunk:  real  6.84  user 6.80 sys 0.03
john:   real  6.76  user 6.11 sys 0.12
this:   real  6.00  user 5.96 sys 0.03

Annotate mysql sql/mysqld.cc:
1.16:   real 21.66 user 21.52 sys 0.14
trunk:  real 
john:   real 19.97 user 19.62 sys 0.17
this:   real 18.69 user 18.56 sys 0.14

time_graph bzr, 25378 nodes, 7586 combinations:
1.16:   N.A.
trunk:  python 127.070s pyrex  49.800s
john:   python   1.030s pyrex   0.700s
this:   python   0.960s pyrex   0.540s

time_graph mysql, 67633 nodes, 33740 combinations:
1.16:   N.A.
trunk:  python  3470.620s pyrex 1751.760s
john:   python   34.190s  pyrex   36.480s
this:   python   33.750s  pyrex   17.670s

So while John fixed version vastly improves that specific part
(gdfo rules), the proposed new algorithms goes even further for
'bushy' graphs.

There are certainly better optimisations to put in place at the
pyrex level, but I thought the improvement was worth a review.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-06-18:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Vincent Ladeuil has proposed merging lp:~vila/bzr/gdfo-heads into lp:bzr.
>
> Requested reviews:
> bzr-core (bzr-core)
>
> This is a new implementation of KnownGraph.heads() and a companion _find_gdfo().

...

>
> Climbing on Jonh's shoulders (all the test and the pyrex plumbing
> was there, waiting ;-P) I was able to improve the performances for
> "dense" graphs (think mysql). The performance for bzr improves
> too, but in a less spectacular way.
>
> This patch also includes a fix by John for its implementation
> that also give some spectacular results, keep that in mind when
> reading the numbers.

Well, you were the one that realized we could stop at "min_gdfo" which
was the big win. I just implemented it the existing pyrex code.

>
> Below are some results for 1.16, bzr.dev@4459, john's version
> (aka an unreleased one, see
> revid:<email address hidden> in
> this branch) and this proposed version (which supersedes John's
> one).
>
> I've used NEWS for bzr as a reference point in the bzr.dev@4459
> tree, and sql/mysqld.cc in dev6 conversion of mysql-6.0@2791.
>
> I've also used tools/time_graph.py to measure the performances
> more precisely.

Actually, time_graph measures something different. Specifically, it
measures interesting nodes in the ancestry of a branch (merge parents),
while annotate is doing things in a per-file graph.

Both are potentially useful, but they really aren't measuring the same
thing.

since you are no longer using heapq you can actually improve the pyrex
quite a bit with:

cdef public long gdfo # Int

Pyrex will automatically wrap a long with the appropriate to/from PyLong
wrappers (actually Python does it when you declare the class object).
And at that point you get to access gdfo as a raw C long, rather than
casting to/from PyInt. (With the downside that you can't handle graphs
larger than 2^31, which I'm not worried about.)

In general, I'm pretty sure the pyrex version is almost identical to the
python version, and I would expect it to not really be any faster. (You
aren't directly accessing List items, etc. And in fact, you deleted my
helpers for doing so.)

I also wasn't actually planning on including 'time_graph.py' in bzr.dev,
that was more of an accident.

Anyway, the algorithm changes are nice, and if you want I'll take some
time to profile and improve the pyrex code.

(this: python 0.960s pyrex 0.540s i guess the pyrex is slightly
faster, but I was seeing a lot more than this in the past...)

John
=:->

So I'll take over some Pyrex perf tomorrow. Otherwise I certainly approve.

review: needs_fixing

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko6sH4ACgkQJdeBCYSNAAM3UgCfdYEW5kol4Bl4/X1j7BSL2UvY
HvUAoLDMCysO6m27BtpB95bKAFykUrXW
=3nmR
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Vincent Ladeuil wrote:
> Vincent Ladeuil has proposed merging lp:~vila/bzr/gdfo-heads into lp:bzr.
> 
> Requested reviews:
>     bzr-core (bzr-core)
> 
> This is a new implementation of KnownGraph.heads() and a companion _find_gdfo().

...

> 
> Climbing on Jonh's shoulders (all the test and the pyrex plumbing
> was there, waiting ;-P) I was able to improve the performances for
> "dense" graphs (think mysql).  The performance for bzr improves
> too, but in a less spectacular way.
> 
> This patch also includes a fix by John for its implementation
> that also give some spectacular results, keep that in mind when
> reading the numbers.

Well, you were the one that realized we could stop at "min_gdfo" which
was the big win. I just implemented it the existing pyrex code.

> 
> Below are some results for 1.16, bzr.dev@4459, john's version
> (aka an unreleased one, see
> revid:john@arbash-meinel.com-20090617144058-lfn4u111x6cn3ihr in
> this branch) and this proposed version (which supersedes John's
> one).
> 
> I've used NEWS for bzr as a reference point in the bzr.dev@4459
> tree, and sql/mysqld.cc in dev6 conversion of mysql-6.0@2791.
> 
> I've also used tools/time_graph.py to measure the performances
> more precisely.

Actually, time_graph measures something different. Specifically, it
measures interesting nodes in the ancestry of a branch (merge parents),
while annotate is doing things in a per-file graph.

Both are potentially useful, but they really aren't measuring the same
thing.

since you are no longer using heapq you can actually improve the pyrex
quite a bit with:

cdef public long gdfo # Int

I also wasn't actually planning on including 'time_graph.py' in bzr.dev,
that was more of an accident.

Anyway, the algorithm changes are nice, and if you want I'll take some
time to profile and improve the pyrex code.

(this:   python   0.960s pyrex   0.540s i guess the pyrex is slightly
faster, but I was seeing a lot more than this in the past...)

John
=:->

So I'll take over some Pyrex perf tomorrow. Otherwise I certainly approve.

review: needs_fixing

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAko6sH4ACgkQJdeBCYSNAAM3UgCfdYEW5kol4Bl4/X1j7BSL2UvY
HvUAoLDMCysO6m27BtpB95bKAFykUrXW
=3nmR
-----END PGP SIGNATURE-----

review: Needs Fixing

Revision history for this message

Vincent Ladeuil (vila) wrote on 2009-06-19:

Download full text (3.5 KiB)

>>>>> "jam" == John A Meinel <email address hidden> writes:

<snip/>

jam> Well, you were the one that realized we could stop at
jam> "min_gdfo" which was the big win.

Right, having ideas doesn't cost much ;-)

jam> I just implemented it the existing pyrex code.

And rightly deserve credit for that ;-)

<snip/>

    jam> Actually, time_graph measures something
    jam> different. Specifically, it measures interesting nodes
    jam> in the ancestry of a branch (merge parents), while
    jam> annotate is doing things in a per-file graph.

I think they are an easy to find, yet pretty representative, of heads() calls.

jam> Both are potentially useful, but they really aren't
jam> measuring the same thing.

I understand that, but I thought they were valid enough to
compare different implementations.

jam> since you are no longer using heapq you can actually improve the pyrex
jam> quite a bit with:

jam> cdef public long gdfo # Int

    jam> Pyrex will automatically wrap a long with the
    jam> appropriate to/from PyLong wrappers (actually Python
    jam> does it when you declare the class object). And at that
    jam> point you get to access gdfo as a raw C long, rather
    jam> than casting to/from PyInt. (With the downside that you
    jam> can't handle graphs larger than 2^31, which I'm not
    jam> worried about.)

AIUI, the potential meat when using pyrex is to avoid conversions
to/from python objects.

As such, I'm damn *sure* we can do far better.

Yet, I still feel it's a bit premature to do it right now
(*especially* considering I deleted the linear dominators related
code, which I still think will be useful).

    jam> In general, I'm pretty sure the pyrex version is almost
    jam> identical to the python version, and I would expect it
    jam> to not really be any faster. (You aren't directly
    jam> accessing List items, etc. And in fact, you deleted my
    jam> helpers for doing so.)

Yes, I tried to keep both implementations as close as possible.

    jam> I also wasn't actually planning on including
    jam> 'time_graph.py' in bzr.dev, that was more of an
    jam> accident.

Keep having accidents like that, time_graph was a key part of
that experiment !

hack && make && selftest && time_graph && rinse && repeat

    jam> Anyway, the algorithm changes are nice, and if you want
    jam> I'll take some time to profile and improve the pyrex
    jam> code.

As said above, I don't want to optimize prematurely.

I'd like to see this landed as it gives a nice improvement in
every case (unless someone can prove us wrong here).

But it also shows (and your experiment more than mine) that micro
tuning has potential at the cost of making the code hard to read
(I may change my minds by practicing pyrex a bit more, but that's
how I feel so far).

The next steps I have in mind are:

First, I'd like to try the KnownGraph strategy to the revno
computation. That will still be an O(history) operation but if we
can make is less crazily long...

From there, we can evaluate caching gdfo (ghost filling problems
yet to be addressed) and make these algorithms works with lazy
loaded graphs.

jam> (this: pytho...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Aaron Bentley

Denys Duchier

Eric Siegerman

Gary van der Merwe

Jelmer Vernooij

John Szakmeister

Jonathan Lange

Marius Kruger

Martin Albisetti

Matt Nordhoff

Paul Hummer

SuperMMX

Talden

Vincent Ladeuil

Yoshinori Sano

to status/vote changes:

Alexander Belchenko

Martin Eisenhardt

Tim Penhey

Bazaar

Merge lp:~vila/bzr/gdfo-heads into lp:~bzr/bzr/trunk-old

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file 'bzrlib/_known_graph_py.py'
 --- bzrlib/_known_graph_py.py	2009-06-16 15:35:14 +0000
 +++ bzrlib/_known_graph_py.py	2009-06-19 02:35:18 +0000
@@ -17,8 +17,6 @@
  """Implementation of Graph algorithms when we have already loaded everything.
  """
--import heapq
--
  from bzrlib import (
      revision,
+     )
@@ -27,27 +25,19 @@
  class _KnownGraphNode(object):
      """Represents a single object in the known graph."""
--    __slots__ = ('key', 'parent_keys', 'child_keys', 'linear_dominator',
--                 'gdfo', 'ancestor_of')
++    __slots__ = ('key', 'parent_keys', 'child_keys', 'gdfo')
      def __init__(self, key, parent_keys):
          self.key = key
          self.parent_keys = parent_keys
          self.child_keys = []
--        # oldest ancestor, such that no parents between here and there have >1
--        # child or >1 parent.
--        self.linear_dominator = None
          # Greatest distance from origin
          self.gdfo = None
--        # This will become a tuple of known heads that have this node as an
--        # ancestor
--        self.ancestor_of = None
      def __repr__(self):
--        return '%s(%s  gdfo:%s par:%s child:%s %s)' % (
++        return '%s(%s  gdfo:%s par:%s child:%s)' % (
              self.__class__.__name__, self.key, self.gdfo,
--            self.parent_keys, self.child_keys,
--            self.linear_dominator)
++            self.parent_keys, self.child_keys)
  class KnownGraph(object):
@@ -63,21 +53,28 @@
          self._known_heads = {}
          self.do_cache = do_cache
          self._initialize_nodes(parent_map)
--        self._find_linear_dominators()
          self._find_gdfo()
      def _initialize_nodes(self, parent_map):
          """Populate self._nodes.
--        After this has finished, self._nodes will have an entry for every entry
--        in parent_map. Ghosts will have a parent_keys = None, all nodes found
--        will also have .child_keys populated with all known child_keys.
++        After this has finished:
++        - self._nodes will have an entry for every entry in parent_map.
++        - ghosts will have a parent_keys = None,
++        - all nodes found will also have .child_keys populated with all known
++          child_keys,
++        - self._tails will list all the nodes without parents.
          """
++        tails = self._tails = set()
          nodes = self._nodes
          for key, parent_keys in parent_map.iteritems():
              if key in nodes:
                  node = nodes[key]
                  node.parent_keys = parent_keys
++                if parent_keys:
++                    # This node has been added before being seen in parent_map
++                    # (see below)
++                    tails.remove(node)
              else:
                  node = _KnownGraphNode(key, parent_keys)
                  nodes[key] = node
@@ -87,129 +84,34 @@
                  except KeyError:
                      parent_node = _KnownGraphNode(parent_key, None)
                      nodes[parent_key] = parent_node
++                    # Potentially a tail, if we're wrong we'll remove it later
++                    # (see above)
++                    tails.add(parent_node)
                  parent_node.child_keys.append(key)
--    def _find_linear_dominators(self):
--        """For each node in the set, find any linear dominators.
--
--        For any given node, the 'linear dominator' is an ancestor, such that
--        all parents between this node and that one have a single parent, and a
--        single child. So if A->B->C->D then B,C,D all have a linear dominator
--        of A.
--
--        There are two main benefits:
--        1) When walking the graph, we can jump to the nearest linear dominator,
--           rather than walking all of the nodes inbetween.
--        2) When caching heads() results, dominators give the "same" results as
--           their children. (If the dominator is a head, then the descendant is
--           a head, if the dominator is not a head, then the child isn't
--           either.)
--        """
--        def check_node(node):
--            if node.parent_keys is None or len(node.parent_keys) != 1:
--                # This node is either a ghost, a tail, or has multiple parents
--                # It its own dominator
--                node.linear_dominator = node.key
--                return None
--            parent_node = self._nodes[node.parent_keys[0]]
--            if len(parent_node.child_keys) > 1:
--                # The parent has multiple children, so *this* node is the
--                # dominator
--                node.linear_dominator = node.key
--                return None
--            # The parent is already filled in, so add and continue
--            if parent_node.linear_dominator is not None:
--                node.linear_dominator = parent_node.linear_dominator
--                return None
--            # We don't know this node, or its parent node, so start walking to
--            # next
--            return parent_node
--
--        for node in self._nodes.itervalues():
--            # The parent is not filled in, so walk until we get somewhere
--            if node.linear_dominator is not None: #already done
--                continue
--            next_node = check_node(node)
--            if next_node is None:
--                # Nothing more needs to be done
--                continue
--            stack = []
--            while next_node is not None:
--                stack.append(node)
--                node = next_node
--                next_node = check_node(node)
--            # The stack now contains the linear chain, and 'node' should have
--            # been labeled
--            dominator = node.linear_dominator
--            while stack:
--                next_node = stack.pop()
--                next_node.linear_dominator = dominator
--                node = next_node
--
      def _find_gdfo(self):
--        def find_tails():
--            return [node for node in self._nodes.itervalues()
--                       if not node.parent_keys]
--        tails = find_tails()
--        todo = []
--        heappush = heapq.heappush
--        heappop = heapq.heappop
          nodes = self._nodes
--        for node in tails:
++        known_parent_gdfos = {}
++        pending = []
++
++        for node in self._tails:
              node.gdfo = 1
--            heappush(todo, (1, node))
--        processed = 0
--        while todo:
--            gdfo, next = heappop(todo)
--            processed += 1
--            if next.gdfo is not None and gdfo < next.gdfo:
--                # This node was reached from a longer path, we assume it was
--                # enqued correctly with the longer gdfo, so don't continue
--                # processing now
--                continue
--            next_gdfo = gdfo + 1
--            for child_key in next.child_keys:
--                child_node = nodes[child_key]
--                if child_node.gdfo is None or child_node.gdfo < next_gdfo:
--                    # Only enque children when all of their parents have been
--                    # resolved
--                    for parent_key in child_node.parent_keys:
--                        # We know that 'this' parent is counted
--                        if parent_key != next.key:
--                            parent_node = nodes[parent_key]
--                            if parent_node.gdfo is None:
--                                break
--                    else:
--                        child_node.gdfo = next_gdfo
--                        heappush(todo, (next_gdfo, child_node))
--
--    def _get_dominators_to_nodes(self, candidate_nodes):
--        """Get the reverse mapping from dominator_key => candidate_nodes.
--
--        As a side effect, this can also remove potential candidate nodes if we
--        determine that they share a dominator.
--        """
--        dom_to_node = {}
--        keys_to_remove = []
--        for node in candidate_nodes.values():
--            if node.linear_dominator in dom_to_node:
--                # This node already exists, resolve which node supersedes the
--                # other
--                other_node = dom_to_node[node.linear_dominator]
--                # There should be no way that nodes sharing a dominator could
--                # 'tie' for gdfo
--                if other_node.gdfo > node.gdfo:
--                    # The other node has this node as an ancestor
--                    keys_to_remove.append(node.key)
--                else:
--                    # Replace the other node, and set this as the new key
--                    keys_to_remove.append(other_node.key)
--                    dom_to_node[node.linear_dominator] = node
--            else:
--                dom_to_node[node.linear_dominator] = node
--        for key in keys_to_remove:
--            candidate_nodes.pop(key)
--        return dom_to_node
++            pending.append(node)
++
++        while pending:
++            node = pending.pop()
++            for child_key in node.child_keys:
++                child = nodes[child_key]
++                try:
++                    known_parent_gdfos[child_key] += 1
++                except KeyError:
++                    known_parent_gdfos[child_key] = 1
++                if child.gdfo is None or node.gdfo + 1 > child.gdfo:
++                    child.gdfo = node.gdfo + 1
++                if known_parent_gdfos[child_key] == len(child.parent_keys):
++                    # We are the last parent updating that node, we can
++                    # continue from there
++                    pending.append(child)
      def heads(self, keys):
          """Return the heads from amongst keys.
@@ -217,9 +119,8 @@
          This is done by searching the ancestries of each key.  Any key that is
          reachable from another key is not returned; all the others are.
--        This operation scales with the relative depth between any two keys. If
--        any two keys are completely disconnected all ancestry of both sides
--        will be retrieved.
++        This operation scales with the relative depth between any two keys. It
++        uses gdfo to avoid walking all ancestry.
          :param keys: An iterable of keys.
          :return: A set of the heads. Note that as a set there is no ordering
@@ -231,114 +132,43 @@
              # NULL_REVISION is only a head if it is the only entry
              candidate_nodes.pop(revision.NULL_REVISION)
              if not candidate_nodes:
--                return set([revision.NULL_REVISION])
++                return frozenset([revision.NULL_REVISION])
          if len(candidate_nodes) < 2:
++            # No or only one candidate
              return frozenset(candidate_nodes)
          heads_key = frozenset(candidate_nodes)
          if heads_key != frozenset(keys):
++            # Mention duplicates
              note('%s != %s', heads_key, frozenset(keys))
++        # Do we have a cached result ?
          try:
              heads = self._known_heads[heads_key]
              return heads
          except KeyError:
--            pass # compute it ourselves
--        dom_to_node = self._get_dominators_to_nodes(candidate_nodes)
--        if len(candidate_nodes) < 2:
--            # We shrunk candidate_nodes and determined a new head
--            return frozenset(candidate_nodes)
--        dom_heads_key = None
--        # Check the linear dominators of these keys, to see if we already
--        # know the heads answer
--        dom_heads_key = frozenset([node.linear_dominator
--                                   for node in candidate_nodes.itervalues()])
--        if dom_heads_key in self._known_heads:
--            # map back into the original keys
--            heads = self._known_heads[dom_heads_key]
--            heads = frozenset([dom_to_node[key].key for key in heads])
--            return heads
--        heads = self._heads_from_candidate_nodes(candidate_nodes, dom_to_node)
++            pass
++        # Let's compute the heads
++        seen = set()
++        pending = []
++        min_gdfo = None
++        for node in candidate_nodes.values():
++            if node.parent_keys:
++                pending.extend(node.parent_keys)
++            if min_gdfo is None or node.gdfo < min_gdfo:
++                min_gdfo = node.gdfo
++        nodes = self._nodes
++        while pending:
++            node_key = pending.pop()
++            if node_key in seen:
++                # node already appears in some ancestry
++                continue
++            seen.add(node_key)
++            node = nodes[node_key]
++            if node.gdfo <= min_gdfo:
++                continue
++            if node.parent_keys:
++                pending.extend(node.parent_keys)
++        heads = heads_key.difference(seen)
          if self.do_cache:
              self._known_heads[heads_key] = heads
--            # Cache the dominator heads
--            if dom_heads_key is not None:
--                dom_heads = frozenset([candidate_nodes[key].linear_dominator
--                                       for key in heads])
--                self._known_heads[dom_heads_key] = dom_heads
          return heads
--    def _heads_from_candidate_nodes(self, candidate_nodes, dom_to_node):
--        queue = []
--        to_cleanup = []
--        to_cleanup_append = to_cleanup.append
--        for node in candidate_nodes.itervalues():
--            node.ancestor_of = (node.key,)
--            queue.append((-node.gdfo, node))
--            to_cleanup_append(node)
--        heapq.heapify(queue)
--        # These are nodes that we determined are 'common' that we are no longer
--        # walking
--        # Now we walk nodes until all nodes that are being walked are 'common'
--        num_candidates = len(candidate_nodes)
--        nodes = self._nodes
--        heappop = heapq.heappop
--        heappush = heapq.heappush
--        while queue and len(candidate_nodes) > 1:
--            _, node = heappop(queue)
--            next_ancestor_of = node.ancestor_of
--            if len(next_ancestor_of) == num_candidates:
--                # This node is now considered 'common'
--                # Make sure all parent nodes are marked as such
--                for parent_key in node.parent_keys:
--                    parent_node = nodes[parent_key]
--                    if parent_node.ancestor_of is not None:
--                        parent_node.ancestor_of = next_ancestor_of
--                if node.linear_dominator != node.key:
--                    parent_node = nodes[node.linear_dominator]
--                    if parent_node.ancestor_of is not None:
--                        parent_node.ancestor_of = next_ancestor_of
--                continue
--            if node.parent_keys is None:
--                # This is a ghost
--                continue
--            # Now project the current nodes ancestor list to the parent nodes,
--            # and queue them up to be walked
--            # Note: using linear_dominator speeds things up quite a bit
--            #       enough that we actually start to be slightly faster
--            #       than the default heads() implementation
--            if node.linear_dominator != node.key:
--                # We are at the tip of a long linear region
--                # We know that there is nothing between here and the tail
--                # that is interesting, so skip to the end
--                parent_keys = [node.linear_dominator]
--            else:
--                parent_keys = node.parent_keys
--            for parent_key in parent_keys:
--                if parent_key in candidate_nodes:
--                    candidate_nodes.pop(parent_key)
--                    if len(candidate_nodes) <= 1:
--                        break
--                elif parent_key in dom_to_node:
--                    orig_node = dom_to_node[parent_key]
--                    if orig_node is not node:
--                        if orig_node.key in candidate_nodes:
--                            candidate_nodes.pop(orig_node.key)
--                            if len(candidate_nodes) <= 1:
--                                break
--                parent_node = nodes[parent_key]
--                ancestor_of = parent_node.ancestor_of
--                if ancestor_of is None:
--                    # This node hasn't been walked yet
--                    parent_node.ancestor_of = next_ancestor_of
--                    # Enqueue this node
--                    heappush(queue, (-parent_node.gdfo, parent_node))
--                    to_cleanup_append(parent_node)
--                elif ancestor_of != next_ancestor_of:
--                    # Combine to get the full set of parents
--                    all_ancestors = set(ancestor_of)
--                    all_ancestors.update(next_ancestor_of)
--                    parent_node.ancestor_of = tuple(sorted(all_ancestors))
--        def cleanup():
--            for node in to_cleanup:
--                node.ancestor_of = None
--        cleanup()
--        return frozenset(candidate_nodes)
 === modified file 'bzrlib/_known_graph_pyx.pyx'
 --- bzrlib/_known_graph_pyx.pyx	2009-06-16 15:35:14 +0000
 +++ bzrlib/_known_graph_pyx.pyx	2009-06-19 02:35:18 +0000
@@ -42,28 +42,15 @@
      void Py_INCREF(object)
--import heapq
--
  from bzrlib import revision
--# Define these as cdef objects, so we don't have to getattr them later
--cdef object heappush, heappop, heapify, heapreplace
--heappush = heapq.heappush
--heappop = heapq.heappop
--heapify = heapq.heapify
--heapreplace = heapq.heapreplace
--
--
  cdef class _KnownGraphNode:
      """Represents a single object in the known graph."""
      cdef object key
      cdef object parents
      cdef object children
--    cdef _KnownGraphNode linear_dominator_node
      cdef public object gdfo # Int
--    # This could also be simplified
--    cdef object ancestor_of
      def __init__(self, key):
          cdef int i
@@ -72,14 +59,8 @@
          self.parents = None
          self.children = []
--        # oldest ancestor, such that no parents between here and there have >1
--        # child or >1 parent.
--        self.linear_dominator_node = None
          # Greatest distance from origin
          self.gdfo = -1
--        # This will become a tuple of known heads that have this node as an
--        # ancestor
--        self.ancestor_of = None
      property child_keys:
          def __get__(self):
@@ -90,17 +71,9 @@
                  PyList_Append(keys, child.key)
              return keys
--    property linear_dominator:
--        def __get__(self):
--            if self.linear_dominator_node is None:
--                return None
--            else:
--                return self.linear_dominator_node.key
--
      cdef clear_references(self):
          self.parents = None
          self.children = None
--        self.linear_dominator_node = None
      def __repr__(self):
          cdef _KnownGraphNode node
@@ -113,34 +86,10 @@
          if self.children is not None:
              for node in self.children:
                  child_keys.append(node.key)
--        return '%s(%s  gdfo:%s par:%s child:%s %s)' % (
++        return '%s(%s  gdfo:%s par:%s child:%s)' % (
              self.__class__.__name__, self.key, self.gdfo,
--            parent_keys, child_keys,
--            self.linear_dominator)
--
--
--cdef _KnownGraphNode _get_list_node(lst, Py_ssize_t pos):
--    cdef PyObject *temp_node
--
--    temp_node = PyList_GET_ITEM(lst, pos)
--    return <_KnownGraphNode>temp_node
--
--
--cdef _KnownGraphNode _get_parent(parents, Py_ssize_t pos):
--    cdef PyObject *temp_node
--    cdef _KnownGraphNode node
--
--    temp_node = PyTuple_GET_ITEM(parents, pos)
--    return <_KnownGraphNode>temp_node
--
--
--cdef _KnownGraphNode _peek_node(queue):
--    cdef PyObject *temp_node
--    cdef _KnownGraphNode node
--
--    temp_node = PyTuple_GET_ITEM(<object>PyList_GET_ITEM(queue, 0), 1)
--    node = <_KnownGraphNode>temp_node
--    return node
++            parent_keys, child_keysr)
++
  # TODO: slab allocate all _KnownGraphNode objects.
  #       We already know how many we are going to need, except for a couple of
@@ -150,11 +99,9 @@
      """This is a class which assumes we already know the full graph."""
      cdef public object _nodes
++    cdef public object _tails
      cdef object _known_heads
      cdef public int do_cache
--    # Nodes we've touched that we'll need to reset their info when heads() is
--    # done
--    cdef object _to_cleanup
      def __init__(self, parent_map, do_cache=True):
          """Create a new KnownGraph instance.
@@ -164,10 +111,8 @@
          self._nodes = {}
          # Maps {sorted(revision_id, revision_id): heads}
          self._known_heads = {}
--        self._to_cleanup = []
          self.do_cache = int(do_cache)
          self._initialize_nodes(parent_map)
--        self._find_linear_dominators()
          self._find_gdfo()
      def __dealloc__(self):
@@ -179,7 +124,7 @@
              child = <_KnownGraphNode>temp_node
              child.clear_references()
--    cdef _KnownGraphNode _get_or_create_node(self, key):
++    cdef _KnownGraphNode _get_or_create_node(self, key, int *created):
          cdef PyObject *temp_node
          cdef _KnownGraphNode node
@@ -187,22 +132,29 @@
          if temp_node == NULL:
              node = _KnownGraphNode(key)
              PyDict_SetItem(self._nodes, key, node)
++            created[0] = 1 # True
          else:
              node = <_KnownGraphNode>temp_node
++            created[0] = 0 # False
          return node
      def _initialize_nodes(self, parent_map):
          """Populate self._nodes.
--        After this has finished, self._nodes will have an entry for every entry
--        in parent_map. Ghosts will have a parent_keys = None, all nodes found
--        will also have .child_keys populated with all known child_keys.
++        After this has finished:
++        - self._nodes will have an entry for every entry in parent_map.
++        - ghosts will have a parent_keys = None,
++        - all nodes found will also have .child_keys populated with all known
++          child_keys,
++        - self._tails will list all the nodes without parents.
          """
          cdef PyObject *temp_key, *temp_parent_keys, *temp_node
          cdef Py_ssize_t pos, pos2, num_parent_keys
          cdef _KnownGraphNode node
          cdef _KnownGraphNode parent_node
++        cdef int created
++        tails = self._tails = set()
          nodes = self._nodes
          if not PyDict_CheckExact(parent_map):
@@ -212,151 +164,56 @@
          while PyDict_Next(parent_map, &pos, &temp_key, &temp_parent_keys):
              key = <object>temp_key
              parent_keys = <object>temp_parent_keys
--            node = self._get_or_create_node(key)
++            num_parent_keys = len(parent_keys)
++            node = self._get_or_create_node(key, &created)
++            if not created and num_parent_keys != 0:
++                # This node has been added before being seen in parent_map (see
++                # below)
++                tails.remove(node)
              # We know how many parents, so we could pre allocate an exact sized
              # tuple here
--            num_parent_keys = len(parent_keys)
              parent_nodes = PyTuple_New(num_parent_keys)
              # We use iter here, because parent_keys maybe be a list or tuple
              for pos2 from 0 <= pos2 < num_parent_keys:
--                parent_key = parent_keys[pos2]
--                parent_node = self._get_or_create_node(parent_keys[pos2])
++                parent_node = self._get_or_create_node(parent_keys[pos2],
++                                                       &created)
++                if created:
++                    # Potentially a tail, if we're wrong we'll remove it later
++                    # (see above)
++                    tails.add(parent_node)
                  # PyTuple_SET_ITEM will steal a reference, so INCREF first
                  Py_INCREF(parent_node)
                  PyTuple_SET_ITEM(parent_nodes, pos2, parent_node)
                  PyList_Append(parent_node.children, node)
              node.parents = parent_nodes
--    cdef _KnownGraphNode _check_is_linear(self, _KnownGraphNode node):
--        """Check to see if a given node is part of a linear chain."""
--        cdef _KnownGraphNode parent_node
--        if node.parents is None or PyTuple_GET_SIZE(node.parents) != 1:
--            # This node is either a ghost, a tail, or has multiple parents
--            # It its own dominator
--            node.linear_dominator_node = node
--            return None
--        parent_node = _get_parent(node.parents, 0)
--        if PyList_GET_SIZE(parent_node.children) > 1:
--            # The parent has multiple children, so *this* node is the
--            # dominator
--            node.linear_dominator_node = node
--            return None
--        # The parent is already filled in, so add and continue
--        if parent_node.linear_dominator_node is not None:
--            node.linear_dominator_node = parent_node.linear_dominator_node
--            return None
--        # We don't know this node, or its parent node, so start walking to
--        # next
--        return parent_node
--
--    def _find_linear_dominators(self):
--        """
--        For any given node, the 'linear dominator' is an ancestor, such that
--        all parents between this node and that one have a single parent, and a
--        single child. So if A->B->C->D then B,C,D all have a linear dominator
--        of A.
--
--        There are two main benefits:
--        1) When walking the graph, we can jump to the nearest linear dominator,
--           rather than walking all of the nodes inbetween.
--        2) When caching heads() results, dominators give the "same" results as
--           their children. (If the dominator is a head, then the descendant is
--           a head, if the dominator is not a head, then the child isn't
--           either.)
--        """
--        cdef PyObject *temp_node
--        cdef Py_ssize_t pos
--        cdef _KnownGraphNode node
--        cdef _KnownGraphNode next_node
--        cdef _KnownGraphNode dominator
--        cdef int i, num_elements
--
--        pos = 0
--        while PyDict_Next(self._nodes, &pos, NULL, &temp_node):
--            node = <_KnownGraphNode>temp_node
--            # The parent is not filled in, so walk until we get somewhere
--            if node.linear_dominator_node is not None: #already done
--                continue
--            next_node = self._check_is_linear(node)
--            if next_node is None:
--                # Nothing more needs to be done
--                continue
--            stack = []
--            while next_node is not None:
--                PyList_Append(stack, node)
--                node = next_node
--                next_node = self._check_is_linear(node)
--            # The stack now contains the linear chain, and 'node' should have
--            # been labeled
--            dominator = node.linear_dominator_node
--            num_elements = len(stack)
--            for i from num_elements > i >= 0:
--                next_node = _get_list_node(stack, i)
--                next_node.linear_dominator_node = dominator
--                node = next_node
--
--    cdef object _find_tails(self):
--        cdef object tails
--        cdef PyObject *temp_node
--        cdef Py_ssize_t pos
--        cdef _KnownGraphNode node
--
--        tails = []
--        pos = 0
--        while PyDict_Next(self._nodes, &pos, NULL, &temp_node):
--            node = <_KnownGraphNode>temp_node
--            if node.parents is None or PyTuple_GET_SIZE(node.parents) == 0:
--                PyList_Append(tails, node)
--        return tails
--
      def _find_gdfo(self):
--        cdef Py_ssize_t pos, pos2
          cdef _KnownGraphNode node
--        cdef _KnownGraphNode child_node
--        cdef _KnownGraphNode parent_node
--        cdef int replace_node, missing_parent
--
--        tails = self._find_tails()
--        todo = []
--        for pos from 0 <= pos < PyList_GET_SIZE(tails):
--            node = _get_list_node(tails, pos)
++        cdef _KnownGraphNode child
++
++        nodes = self._nodes
++        pending = []
++        known_parent_gdfos = {}
++
++        for node in self._tails:
              node.gdfo = 1
--            PyList_Append(todo, (1, node))
--        # No need to heapify, because all tails have priority=1
--        while PyList_GET_SIZE(todo) > 0:
--            node = _peek_node(todo)
--            next_gdfo = node.gdfo + 1
--            replace_node = 1
--            for pos from 0 <= pos < PyList_GET_SIZE(node.children):
--                child_node = _get_list_node(node.children, pos)
--                # We should never have numbered children before we numbered
--                # a parent
--                if child_node.gdfo != -1:
--                    continue
--                # Only enque children when all of their parents have been
--                # resolved. With a single parent, we can just take 'this' value
--                child_gdfo = next_gdfo
--                if PyTuple_GET_SIZE(child_node.parents) > 1:
--                    missing_parent = 0
--                    for pos2 from 0 <= pos2 < PyTuple_GET_SIZE(child_node.parents):
--                        parent_node = _get_parent(child_node.parents, pos2)
--                        if parent_node.gdfo == -1:
--                            missing_parent = 1
--                            break
--                        if parent_node.gdfo >= child_gdfo:
--                            child_gdfo = parent_node.gdfo + 1
--                    if missing_parent:
--                        # One of the parents is not numbered, so wait until we get
--                        # back here
--                        continue
--                child_node.gdfo = child_gdfo
--                if replace_node:
--                    heapreplace(todo, (child_gdfo, child_node))
--                    replace_node = 0
--                else:
--                    heappush(todo, (child_gdfo, child_node))
--            if replace_node:
--                heappop(todo)
++            known_parent_gdfos[node] = 0
++            pending.append(node)
++
++        while pending:
++            node = <_KnownGraphNode>pending.pop()
++            for child in node.children:
++                try:
++                    known_parents = known_parent_gdfos[child.key]
++                except KeyError:
++                    known_parents = 0
++                known_parent_gdfos[child.key] = known_parents + 1
++                if child.gdfo is None or node.gdfo + 1 > child.gdfo:
++                    child.gdfo = node.gdfo + 1
++                if known_parent_gdfos[child.key] == len(child.parents):
++                    # We are the last parent updating that node, we can
++                    # continue from there
++                    pending.append(child)
      def heads(self, keys):
          """Return the heads from amongst keys.
@@ -364,9 +221,8 @@
          This is done by searching the ancestries of each key.  Any key that is
          reachable from another key is not returned; all the others are.
--        This operation scales with the relative depth between any two keys. If
--        any two keys are completely disconnected all ancestry of both sides
--        will be retrieved.
++        This operation scales with the relative depth between any two keys. It
++        uses gdfo to avoid walking all ancestry.
          :param keys: An iterable of keys.
          :return: A set of the heads. Note that as a set there is no ordering
@@ -375,12 +231,13 @@
          """
          cdef PyObject *maybe_node
          cdef PyObject *maybe_heads
++        cdef PyObject *temp_node
++        cdef _KnownGraphNode node
          heads_key = PyFrozenSet_New(keys)
          maybe_heads = PyDict_GetItem(self._known_heads, heads_key)
          if maybe_heads != NULL:
              return <object>maybe_heads
--
          # Not cached, compute it ourselves
          candidate_nodes = {}
          nodes = self._nodes
@@ -398,208 +255,30 @@
              heads_key = PyFrozenSet_New(candidate_nodes)
          if len(candidate_nodes) < 2:
              return heads_key
--        dom_to_node = self._get_dominators_to_nodes(candidate_nodes)
--        if PyDict_Size(candidate_nodes) < 2:
--            return frozenset(candidate_nodes)
--        dom_lookup_key, heads = self._heads_from_dominators(candidate_nodes,
--                                                            dom_to_node)
--        if heads is not None:
--            if self.do_cache:
--                # This heads was not in the cache, or it would have been caught
--                # earlier, but the dom head *was*, so do the simple cache
--                PyDict_SetItem(self._known_heads, heads_key, heads)
--            return heads
--        heads = self._heads_from_candidate_nodes(candidate_nodes, dom_to_node)
++
++        seen = set()
++        pending = []
++        cdef Py_ssize_t pos
++        pos = 0
++        min_gdfo = None
++        while PyDict_Next(candidate_nodes, &pos, NULL, &temp_node):
++            node = <_KnownGraphNode>temp_node
++            if node.parents is not None:
++                pending.extend(node.parents)
++            if min_gdfo is None or node.gdfo < min_gdfo:
++                min_gdfo = node.gdfo
++        nodes = self._nodes
++        while pending:
++            node = pending.pop()
++            if node.key in seen:
++                # node already appears in some ancestry
++                continue
++            seen.add(node.key)
++            if node.gdfo <= min_gdfo:
++                continue
++            if node.parents:
++                pending.extend(node.parents)
++        heads = heads_key.difference(seen)
          if self.do_cache:
--            self._cache_heads(heads, heads_key, dom_lookup_key, candidate_nodes)
++            self._known_heads[heads_key] = heads
          return heads
--
--    cdef object _cache_heads(self, heads, heads_key, dom_lookup_key,
--                             candidate_nodes):
--        cdef PyObject *maybe_node
--        cdef _KnownGraphNode node
--
--        PyDict_SetItem(self._known_heads, heads_key, heads)
--        dom_heads = []
--        for key in heads:
--            maybe_node = PyDict_GetItem(candidate_nodes, key)
--            if maybe_node == NULL:
--                raise KeyError
--            node = <_KnownGraphNode>maybe_node
--            PyList_Append(dom_heads, node.linear_dominator_node.key)
--        PyDict_SetItem(self._known_heads, dom_lookup_key,
--                       PyFrozenSet_New(dom_heads))
--
--    cdef _get_dominators_to_nodes(self, candidate_nodes):
--        """Get the reverse mapping from dominator_key => candidate_nodes.
--
--        As a side effect, this can also remove potential candidate nodes if we
--        determine that they share a dominator.
--        """
--        cdef Py_ssize_t pos
--        cdef _KnownGraphNode node, other_node
--        cdef PyObject *temp_node
--        cdef PyObject *maybe_node
--
--        dom_to_node = {}
--        keys_to_remove = []
--        pos = 0
--        while PyDict_Next(candidate_nodes, &pos, NULL, &temp_node):
--            node = <_KnownGraphNode>temp_node
--            dom_key = node.linear_dominator_node.key
--            maybe_node = PyDict_GetItem(dom_to_node, dom_key)
--            if maybe_node == NULL:
--                PyDict_SetItem(dom_to_node, dom_key, node)
--            else:
--                other_node = <_KnownGraphNode>maybe_node
--                # These nodes share a dominator, one of them obviously
--                # supersedes the other, figure out which
--                if other_node.gdfo > node.gdfo:
--                    PyList_Append(keys_to_remove, node.key)
--                else:
--                    # This wins, replace the other
--                    PyList_Append(keys_to_remove, other_node.key)
--                    PyDict_SetItem(dom_to_node, dom_key, node)
--        for pos from 0 <= pos < PyList_GET_SIZE(keys_to_remove):
--            key = <object>PyList_GET_ITEM(keys_to_remove, pos)
--            candidate_nodes.pop(key)
--        return dom_to_node
--
--    cdef object _heads_from_dominators(self, candidate_nodes, dom_to_node):
--        cdef PyObject *maybe_heads
--        cdef PyObject *maybe_node
--        cdef _KnownGraphNode node
--        cdef Py_ssize_t pos
--        cdef PyObject *temp_node
--
--        dom_list_key = []
--        pos = 0
--        while PyDict_Next(candidate_nodes, &pos, NULL, &temp_node):
--            node = <_KnownGraphNode>temp_node
--            PyList_Append(dom_list_key, node.linear_dominator_node.key)
--        dom_lookup_key = PyFrozenSet_New(dom_list_key)
--        maybe_heads = PyDict_GetItem(self._known_heads, dom_lookup_key)
--        if maybe_heads == NULL:
--            return dom_lookup_key, None
--        # We need to map back from the dominator head to the original keys
--        dom_heads = <object>maybe_heads
--        heads = []
--        for dom_key in dom_heads:
--            maybe_node = PyDict_GetItem(dom_to_node, dom_key)
--            if maybe_node == NULL:
--                # Should never happen
--                raise KeyError
--            node = <_KnownGraphNode>maybe_node
--            PyList_Append(heads, node.key)
--        return dom_lookup_key, PyFrozenSet_New(heads)
--
--    cdef int _process_parent(self, _KnownGraphNode node,
--                             _KnownGraphNode parent_node,
--                             candidate_nodes, dom_to_node,
--                             queue, int *replace_item) except -1:
--        """Process the parent of a node, seeing if we need to walk it."""
--        cdef PyObject *maybe_candidate
--        cdef PyObject *maybe_node
--        cdef _KnownGraphNode dom_child_node
--        maybe_candidate = PyDict_GetItem(candidate_nodes, parent_node.key)
--        if maybe_candidate != NULL:
--            candidate_nodes.pop(parent_node.key)
--            # We could pass up a flag that tells the caller to stop processing,
--            # but it doesn't help much, and makes the code uglier
--            return 0
--        maybe_node = PyDict_GetItem(dom_to_node, parent_node.key)
--        if maybe_node != NULL:
--            # This is a dominator of a node
--            dom_child_node = <_KnownGraphNode>maybe_node
--            if dom_child_node is not node:
--                # It isn't a dominator of a node we are searching, so we should
--                # remove it from the search
--                maybe_candidate = PyDict_GetItem(candidate_nodes, dom_child_node.key)
--                if maybe_candidate != NULL:
--                    candidate_nodes.pop(dom_child_node.key)
--                    return 0
--        if parent_node.ancestor_of is None:
--            # This node hasn't been walked yet, so just project node's ancestor
--            # info directly to parent_node, and enqueue it for later processing
--            parent_node.ancestor_of = node.ancestor_of
--            if replace_item[0]:
--                heapreplace(queue, (-parent_node.gdfo, parent_node))
--                replace_item[0] = 0
--            else:
--                heappush(queue, (-parent_node.gdfo, parent_node))
--            PyList_Append(self._to_cleanup, parent_node)
--        elif parent_node.ancestor_of != node.ancestor_of:
--            # Combine to get the full set of parents
--            # Rewrite using PySet_* functions, unfortunately you have to use
--            # PySet_Add since there is no PySet_Update... :(
--            all_ancestors = set(parent_node.ancestor_of)
--            for k in node.ancestor_of:
--                PySet_Add(all_ancestors, k)
--            parent_node.ancestor_of = tuple(sorted(all_ancestors))
--        return 0
--
--    cdef object _heads_from_candidate_nodes(self, candidate_nodes, dom_to_node):
--        cdef _KnownGraphNode node
--        cdef _KnownGraphNode parent_node
--        cdef Py_ssize_t num_candidates
--        cdef int num_parents, replace_item
--        cdef Py_ssize_t pos
--        cdef PyObject *temp_node
--
--        queue = []
--        pos = 0
--        while PyDict_Next(candidate_nodes, &pos, NULL, &temp_node):
--            node = <_KnownGraphNode>temp_node
--            node.ancestor_of = (node.key,)
--            PyList_Append(queue, (-node.gdfo, node))
--            PyList_Append(self._to_cleanup, node)
--        heapify(queue)
--        # These are nodes that we determined are 'common' that we are no longer
--        # walking
--        # Now we walk nodes until all nodes that are being walked are 'common'
--        num_candidates = len(candidate_nodes)
--        replace_item = 0
--        while PyList_GET_SIZE(queue) > 0 and PyDict_Size(candidate_nodes) > 1:
--            if replace_item:
--                # We still need to pop the smallest member out of the queue
--                # before we peek again
--                heappop(queue)
--                if PyList_GET_SIZE(queue) == 0:
--                    break
--            # peek at the smallest item. We don't pop, because we expect we'll
--            # need to push more things into the queue anyway
--            node = _peek_node(queue)
--            replace_item = 1
--            if PyTuple_GET_SIZE(node.ancestor_of) == num_candidates:
--                # This node is now considered 'common'
--                # Make sure all parent nodes are marked as such
--                for pos from 0 <= pos < PyTuple_GET_SIZE(node.parents):
--                    parent_node = _get_parent(node.parents, pos)
--                    if parent_node.ancestor_of is not None:
--                        parent_node.ancestor_of = node.ancestor_of
--                if node.linear_dominator_node is not node:
--                    parent_node = node.linear_dominator_node
--                    if parent_node.ancestor_of is not None:
--                        parent_node.ancestor_of = node.ancestor_of
--                continue
--            if node.parents is None:
--                # This is a ghost
--                continue
--            # Now project the current nodes ancestor list to the parent nodes,
--            # and queue them up to be walked
--            if node.linear_dominator_node is not node:
--                # We are at the tip of a long linear region
--                # We know that there is nothing between here and the tail
--                # that is interesting, so skip to the end
--                self._process_parent(node, node.linear_dominator_node,
--                                     candidate_nodes, dom_to_node, queue, &replace_item)
--            else:
--                for pos from 0 <= pos < PyTuple_GET_SIZE(node.parents):
--                    parent_node = _get_parent(node.parents, pos)
--                    self._process_parent(node, parent_node, candidate_nodes,
--                                         dom_to_node, queue, &replace_item)
--        for pos from 0 <= pos < PyList_GET_SIZE(self._to_cleanup):
--            node = _get_list_node(self._to_cleanup, pos)
--            node.ancestor_of = None
--        self._to_cleanup = []
--        return PyFrozenSet_New(candidate_nodes)
 === modified file 'bzrlib/tests/test__known_graph.py'
 --- bzrlib/tests/test__known_graph.py	2009-06-16 15:35:14 +0000
 +++ bzrlib/tests/test__known_graph.py	2009-06-19 02:35:18 +0000
@@ -71,10 +71,6 @@
      def make_known_graph(self, ancestry):
          return self.module.KnownGraph(ancestry, do_cache=self.do_cache)
--    def assertDominator(self, graph, rev, dominator):
--        node = graph._nodes[rev]
--        self.assertEqual(dominator, node.linear_dominator)
--
      def assertGDFO(self, graph, rev, gdfo):
          node = graph._nodes[rev]
          self.assertEqual(gdfo, node.gdfo)
@@ -88,29 +84,6 @@
          self.assertEqual(['rev4'], sorted(graph._nodes['rev3'].child_keys))
          self.assertEqual(['rev4'], sorted(graph._nodes['rev2b'].child_keys))
--    def test_dominators_ancestry_1(self):
--        graph = self.make_known_graph(test_graph.ancestry_1)
--        self.assertDominator(graph, 'rev1', NULL_REVISION)
--        self.assertDominator(graph, 'rev2b', 'rev2b')
--        self.assertDominator(graph, 'rev2a', 'rev2a')
--        self.assertDominator(graph, 'rev3', 'rev2a')
--        self.assertDominator(graph, 'rev4', 'rev4')
--
--    def test_dominators_feature_branch(self):
--        graph = self.make_known_graph(test_graph.feature_branch)
--        self.assertDominator(graph, 'rev1', NULL_REVISION)
--        self.assertDominator(graph, 'rev2b', NULL_REVISION)
--        self.assertDominator(graph, 'rev3b', NULL_REVISION)
--
--    def test_dominators_extended_history_shortcut(self):
--        graph = self.make_known_graph(test_graph.extended_history_shortcut)
--        self.assertDominator(graph, 'a', NULL_REVISION)
--        self.assertDominator(graph, 'b', 'b')
--        self.assertDominator(graph, 'c', 'b')
--        self.assertDominator(graph, 'd', 'b')
--        self.assertDominator(graph, 'e', 'e')
--        self.assertDominator(graph, 'f', 'f')
--
      def test_gdfo_ancestry_1(self):
          graph = self.make_known_graph(test_graph.ancestry_1)
          self.assertGDFO(graph, 'rev1', 2)
@@ -229,3 +202,14 @@
          self.assertEqual(set(['z']), graph.heads(['w', 's', 'z']))
          self.assertEqual(set(['w', 'q']), graph.heads(['w', 's', 'q']))
          self.assertEqual(set(['z']), graph.heads(['s', 'z']))
++
++    def test_heads_with_ghost(self):
++        graph = self.make_known_graph(test_graph.with_ghost)
++        self.assertEqual(set(['e', 'g']), graph.heads(['e', 'g']))
++        self.assertEqual(set(['a', 'c']), graph.heads(['a', 'c']))
++        self.assertEqual(set(['a', 'g']), graph.heads(['a', 'g']))
++        self.assertEqual(set(['f', 'g']), graph.heads(['f', 'g']))
++        self.assertEqual(set(['c']), graph.heads(['c', 'g']))
++        self.assertEqual(set(['c']), graph.heads(['c', 'b', 'd', 'g']))
++        self.assertEqual(set(['a', 'c']), graph.heads(['a', 'c', 'e', 'g']))
++        self.assertEqual(set(['a', 'c']), graph.heads(['a', 'c', 'f']))
 === modified file 'tools/time_graph.py'
 --- tools/time_graph.py	2009-06-12 18:05:15 +0000
 +++ tools/time_graph.py	2009-06-19 02:35:18 +0000
@@ -16,12 +16,15 @@
  from bzrlib.ui import text
  p = optparse.OptionParser()
++p.add_option('--quick', default=False, action='store_true')
  p.add_option('--max-combinations', default=500, type=int)
  p.add_option('--lsprof', default=None, type=str)
  opts, args = p.parse_args(sys.argv[1:])
++
  trace.enable_default_logging()
  ui.ui_factory = text.TextUIFactory()
++begin = time.clock()
  if len(args) >= 1:
      b = branch.Branch.open(args[0])
  else:
@@ -33,8 +36,9 @@
                           if p[1] is not None)
  finally:
      b.unlock()
++end = time.clock()
--print 'Found %d nodes' % (len(parent_map),)
++print 'Found %d nodes, loaded in %.3fs' % (len(parent_map), end - begin)
  def all_heads_comp(g, combinations):
      h = []
@@ -47,6 +51,7 @@
      finally:
          pb.finished()
      return h
++
  combinations = []
  # parents = parent_map.keys()
  # for p1 in parents:
@@ -65,33 +70,48 @@
      combinations = random.sample(combinations, opts.max_combinations)
  print '      %d combinations' % (len(combinations),)
--t1 = time.clock()
--known_g = _known_graph_py.KnownGraph(parent_map)
--if opts.lsprof is not None:
--    h_known = commands.apply_lsprofiled(opts.lsprof,
--        all_heads_comp, known_g, combinations)
--else:
--    h_known = all_heads_comp(known_g, combinations)
--t2 = time.clock()
--print "Known: %.3fs" % (t2-t1,)
--print "  %s" % (graph._counters,)
--t1 = time.clock()
--known_g = _known_graph_pyx.KnownGraph(parent_map)
--if opts.lsprof is not None:
--    h_known = commands.apply_lsprofiled(opts.lsprof,
--        all_heads_comp, known_g, combinations)
--else:
--    h_known = all_heads_comp(known_g, combinations)
--t2 = time.clock()
--print "Known (pyx): %.3fs" % (t2-t1,)
--print "  %s" % (graph._counters,)
--simple_g = graph.Graph(graph.DictParentsProvider(parent_map))
--graph._counters[1] = 0
--graph._counters[2] = 0
--h_simple = all_heads_comp(simple_g, combinations)
--t3 = time.clock()
--print "Orig: %.3fs" % (t3-t2,)
--print "  %s" % (graph._counters,)
--if h_simple != h_known:
--    import pdb; pdb.set_trace()
--print 'ratio: %.3fs' % ((t2-t1) / (t3-t2))
++
++def combi_graph(graph_klass, comb):
++    # DEBUG
++    graph._counters[1] = 0
++    graph._counters[2] = 0
++
++    begin = time.clock()
++    g = graph_klass(parent_map)
++    if opts.lsprof is not None:
++        heads = commands.apply_lsprofiled(opts.lsprof, all_heads_comp, g, comb)
++    else:
++        heads = all_heads_comp(g, comb)
++    end = time.clock()
++    return dict(elapsed=(end - begin), graph=g, heads=heads)
++
++def report(name, g):
++    print '%s: %.3fs' % (name, g['elapsed'])
++    counters_used = False
++    for c in graph._counters:
++        if c:
++            counters_used = True
++    if counters_used:
++        print '  %s' % (graph._counters,)
++
++known_python = combi_graph(_known_graph_py.KnownGraph, combinations)
++report('Known', known_python)
++
++known_pyrex = combi_graph(_known_graph_pyx.KnownGraph, combinations)
++report('Known (pyx)', known_pyrex)
++
++def _simple_graph(parent_map):
++    return graph.Graph(graph.DictParentsProvider(parent_map))
++
++if opts.quick:
++    if known_python['heads'] != known_pyrex['heads']:
++        import pdb; pdb.set_trace()
++    print 'ratio: %.3fs' % (known_pyrex['elapsed'] / known_python['elapsed'])
++else:
++    orig = combi_graph(_simple_graph, combinations)
++    report('Orig', orig)
++
++    if orig['heads'] != known_pyrex['heads']:
++        import pdb; pdb.set_trace()
++
++    print 'ratio: %.3fs' % (known_pyrex['elapsed'] / orig['elapsed'])