Bazaar

Merge lp:~jameinel/bzr/2.1-simple-set into lp:bzr

2.1-simple-set
Merge into bzr.dev

Proposed by John A Meinel on 2009-10-08

Status:	Merged
Merged at revision:	not available
Proposed branch:	lp:~jameinel/bzr/2.1-simple-set
Merge into:	lp:bzr
Diff against target:	1146 lines 8 files modified .bzrignore (+3/-0) NEWS (+7/-0) bzrlib/_simple_set_pyx.pxd (+91/-0) bzrlib/_simple_set_pyx.pyx (+600/-0) bzrlib/python-compat.h (+5/-0) bzrlib/tests/__init__.py (+1/-0) bzrlib/tests/test__simple_set.py (+371/-0) setup.py (+1/-0)
To merge this branch:	bzr merge lp:~jameinel/bzr/2.1-simple-set
Related bugs:	Link a bug report

Reviewer	Review Type	Date Requested	Status
Andrew Bennetts		2009-10-08	Approve on 2009-10-09
Review via email: mp+13039@code.launchpad.net

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-08:

Download full text (3.6 KiB)

This is the first in my series of patches to lower memory overhead with StaticTuple objects.

I picked this first because it wasn't strictly dependent on the other code, but I use it in the implementation of StaticTuple's interning.

Anyway, this introduces a "SimpleSet" class. For now, I have a pyrex version but no pure-python version, as it is only currently used in StaticTuple. I may go to the point of implementing a python version, though it is fairly close in api to a set or a dict. The big thing is that we don't really have values like a dict, but a set() doesn't let you get access to the object which is stored in the internal table.

So the 2 primary wins for this class is:

1) Don't cache the hash of every object. For the types of object we are putting in here, the hash should be relatively cheap to compute. Even further, though, the times when you need the hash are:
a) When inserting a new object, you need to know its hash, but you haven't cached it yet anyway.
b) When resolving a collision, you compare the hash to the cached value as a 'cheap' comparison.
However, the number of collisions is based on the quality of your hash, your collision avoidance algorithm, and the size of your table. We can't really change the hash function. We could use 'quadratic hash', but what sets use seems pretty good anyway (it mixes in more of the upper bits, so you probably get divergence faster, but you also lose locality...)
As for the size of the table. It takes the same number of bytes to cache a 'long hash', as it takes to hold a 'PyObject *'. Which means that in the same memory, you could cache hashes *or* double your addressable space and halve the number of collisions.

2) Allow lookups, so we don't need to use a Dict, which has yet another pointer per address. (So for the same number of entries, this will generally be 1/3rd the size of the equivalent dict, and 1/2 the size of the equivalent set.)

3) Have a single function for the equivalent function 'key = dict.setdefault(key, key)'. At the C api, you could use dict->lookup which is a rather private function, or you would generally use PyDict_GetItem() followed by PyDict_SetItem().
With SimpleSet_Add(), it returns the object stored there, so you have a single lookup. (Only really important in the case of collisions, where you may have several steps that need to be repeated.)

Because of the memory savings, I may look into using this elsewhere in our code base. If I do, then I will certainly implement a Python version (probably just subclassing a dict and exposing an 'def add(self key): return self.setdefault(key, key)' function. (And whatever else I specifically need.

As for the specific memory savings here, dicts and sets are resized to average 50% full (resize at 67%, etc.), so is SimpleSet [theory says hash tables fall apart at about 80% full]. The idea is to start interning the StaticTuple objects I'm creating. (For every key, you average at least 2 references, one for the key, and one for all of its children.)
When loading all of launchpad, that translates into about 500k strings, and 1.1M interned StaticTuples (there are actually more, because of how the btree code uses tuples-of...

This is the first in my series of patches to lower memory overhead with StaticTuple objects.

I picked this first because it wasn't strictly dependent on the other code, but I use it in the implementation of StaticTuple's interning.

So the 2 primary wins for this class is:

1) Don't cache the hash of every object. For the types of object we are putting in here, the hash should be relatively cheap to compute. Even further, though, the times when you need the hash are:
  a) When inserting a new object, you need to know its hash, but you haven't cached it yet anyway.
  b) When resolving a collision, you compare the hash to the cached value as a 'cheap' comparison.
However, the number of collisions is based on the quality of your hash, your collision avoidance algorithm, and the size of your table. We can't really change the hash function. We could use 'quadratic hash', but what sets use seems pretty good anyway (it mixes in more of the upper bits, so you probably get divergence faster, but you also lose locality...)
As for the size of the table. It takes the same number of bytes to cache a 'long hash', as it takes to hold a 'PyObject *'. Which means that in the same memory, you could cache hashes *or* double your addressable space and halve the number of collisions.

Anyway, that means that 1.1M tuples requires an 8MB SimpleSet, but would require a 24MB dict. (So my in my memory savings of going from 325MB => 240MB [85MB], 16/85=19% of that is by using a SimpleSet here.)

Especially given that the size of all StaticTuples plus there referenced strings is only 44.8MB. (This is also why intern(string) is not particularly memory efficient, as it costs 24bytes per item in the dict, versus 24+N bytes per string.)

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-10-09:

Download full text (10.0 KiB)

I'm very excited by this patch series!

This one is:

review approve

Although there are some comments you should look at, especially regarding
licensing...

John A Meinel wrote:
[...]
> === added file 'bzrlib/_simple_set_pyx.pxd'
[...]
> +"""Interface definition of a class like PySet but without caching the hash.
> +
> +This is generally useful when you want to 'intern' objects, etc. Note that this
> +differs from Set in that we:
> + 1) Don't have all of the .intersection, .difference, etc functions
> + 2) Do return the object from the set via queries
> + eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
> +"""

I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
It's a slightly unusual mix of dict-like and set-like APIs. I think it's fine
after some thinking about it, I'm just noting that it seems a bit funny at
first. Although maybe .remove would be more set-like than .__delitem__?

> +cdef api object SimpleSet_Add(object self, object key)
> +cdef api SimpleSet SimpleSet_New()
> +cdef api object SimpleSet_Add(object self, object key)

That appears to be a duplicate declaration of SimpleSet_Add.

> +cdef api int SimpleSet_Contains(object self, object key) except -1
> +cdef api int SimpleSet_Discard(object self, object key) except -1
> +cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
> +cdef api Py_ssize_t SimpleSet_Size(object self) except -1
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
>
> === added file 'bzrlib/_simple_set_pyx.pyx'
[...]
> +cdef object _dummy_obj
> +cdef PyObject *_dummy
> +_dummy_obj = object()
> +_dummy = <PyObject *>_dummy_obj

It's not very clear what _dummy is used for. I guess what's missing is some
text giving an overview of the data structure, although I suppose that's what
you mean to convey by the docstrings saying it is similar to the builtin
dict/set types.

Anyway, it appears _dummy is used to avoid resizing/compacting the
table for every single discard. [Part of why it would be nice to have some text
describing the data structure is so that there's some clear terminology to
use... the code pretty clearly talks about “tables” and “slots”, which seem
fairly clear, but then what's an “entry”?]

> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> + cdef long other_hash
> + cdef PyObject *res
> +
> + if this == other:
> + return 1
> + other_hash = Py_TYPE(other).tp_hash(other)

What happens if 'other' is not hashable?

> + if other_hash != this_hash:
> + return 0
> + res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)

Similarly, what if one the richcompare calls raise an exception?

I'm very excited by this patch series!

This one is:

review approve

Although there are some comments you should look at, especially regarding
licensing...

John A Meinel wrote:
[...]
> === added file 'bzrlib/_simple_set_pyx.pxd'
[...]
> +"""Interface definition of a class like PySet but without caching the hash.
> +
> +This is generally useful when you want to 'intern' objects, etc. Note that this
> +differs from Set in that we:
> +  1) Don't have all of the .intersection, .difference, etc functions
> +  2) Do return the object from the set via queries
> +     eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
> +"""

I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
It's a slightly unusual mix of dict-like and set-like APIs.  I think it's fine
after some thinking about it, I'm just noting that it seems a bit funny at
first.  Although maybe .remove would be more set-like than .__delitem__?

> +cdef api object SimpleSet_Add(object self, object key)
> +cdef api SimpleSet SimpleSet_New()
> +cdef api object SimpleSet_Add(object self, object key)

That appears to be a duplicate declaration of SimpleSet_Add.

> +cdef api int SimpleSet_Contains(object self, object key) except -1
> +cdef api int SimpleSet_Discard(object self, object key) except -1
> +cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
> +cdef api Py_ssize_t SimpleSet_Size(object self) except -1
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
> 
> === added file 'bzrlib/_simple_set_pyx.pyx'
[...]
> +cdef object _dummy_obj
> +cdef PyObject *_dummy
> +_dummy_obj = object()
> +_dummy = <PyObject *>_dummy_obj

It's not very clear what _dummy is used for.  I guess what's missing is some
text giving an overview of the data structure, although I suppose that's what
you mean to convey by the docstrings saying it is similar to the builtin
dict/set types.

Anyway, it appears _dummy is used to avoid resizing/compacting the
table for every single discard.  [Part of why it would be nice to have some text
describing the data structure is so that there's some clear terminology to
use... the code pretty clearly talks about “tables” and “slots”, which seem
fairly clear, but then what's an “entry”?]

> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> +    cdef long other_hash
> +    cdef PyObject *res
> +
> +    if this == other:
> +        return 1
> +    other_hash = Py_TYPE(other).tp_hash(other)

What happens if 'other' is not hashable?

> +    if other_hash != this_hash:
> +        return 0
> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)

Similarly, what if one the richcompare calls raise an exception?

> +cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
[...]
> +    cdef int _insert_clean(self, PyObject *key) except -1:
> +        """Insert a key into self.table.
> +
> +        This is only meant to be used during times like '_resize',
> +        as it makes a lot of assuptions about keys not already being present,
> +        and there being no dummy entries.
> +        """
> +        cdef size_t i, perturb, mask
> +        cdef long the_hash
> +        cdef PyObject **table, **entry
> +
> +        mask = self._mask
> +        table = self._table
> +
> +        the_hash = Py_TYPE(key).tp_hash(key)
> +        i = the_hash & mask
> +        entry = &table[i]
> +        perturb = the_hash
> +        # Because we know that we made all items unique before, we can just
> +        # iterate as long as the target location is not empty, we don't have to
> +        # do any comparison, etc.
> +        while entry[0] != NULL:
> +            i = (i << 2) + i + perturb + 1
> +            entry = &table[i & mask]
> +            perturb >>= PERTURB_SHIFT

This is code borrowed from Python's dictobject.c/setobject.c, I think?  I think
you need to:

- point to the Python source, which explains this logic
 - and maybe adjust the copyright & licence declaration.  I'm no expert, but
   maybe we shouldn't be claiming sole ownership of this code... maybe put the
   key parts like this that are fairly directly taken from Python (assuming I'm
   right about that!) into a separate file, so we keep it clear which code is
   solely ours, and which is mixed.  Possibly that's just this function and the
   _lookup function?

> +    cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
[...]
> +        # We rolled over our signed size field
> +        if new_size <= 0:
> +            raise MemoryError()

It's good to see code being so pedantic :)

> +    def discard(self, key):
> +        """Remove key from the dict, whether it exists or not.
> +
> +        :return: 0 if the item did not exist, 1 if it did
> +        """
> +        return self._discard(key)

It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...

> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
> +    """Find the slot where 'key' would fit.
> +
> +    This is the same as a dicts 'lookup' function.
> +
> +    :param key: An object we are looking up
> +    :param hash: The hash for key
> +    :return: The location in self.table where key should be put
> +        should never be NULL, but may reference a NULL (PyObject*)
> +    """

I don't quite follow the “should never be NULL, but may reference a NULL” part
of that docstring.

[...]
> +    if cur[0] == NULL:
> +        # Found a blank spot, or found the exact key
> +        return cur
> +    if cur[0] == py_key:
> +        return cur
> +    if cur[0] == _dummy:
> +        free_slot = cur
> +    else:
> +        if _is_equal(py_key, key_hash, cur[0]):
> +            # Both py_key and cur[0] belong in this slot, return it
> +            return cur
> +        free_slot = NULL
> +    # size_t is unsigned, hash is signed...
> +    perturb = key_hash
> +    while True:
> +        i = (i << 2) + i + perturb + 1
> +        cur = &table[i & mask]
> +        if cur[0] == NULL: # Found an empty spot
> +            if free_slot: # Did we find a _dummy earlier?
> +                return free_slot
> +            else:
> +                return cur
> +        if (cur[0] == py_key # exact match
> +            or _is_equal(py_key, key_hash, cur[0])): # Equivalent match
> +            return cur
> +        if (cur[0] == _dummy and free_slot == NULL):
> +            free_slot = cur
> +        perturb >>= PERTURB_SHIFT
> +    raise AssertionError('should never get here')

Hmm, this appears to be essentially:

<loop body>
while True:
    <increment stuff>
    <loop body>

Why can't it be:

while True:
    <loop body>
    <increment stuff>

> +# TODO: this should probably have direct tests, since it isn't used by __iter__
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):

I agree with that TODO ;)

Although getting rid of the duplication between this and
_SimpleSet_iterator.__next__ would be even nicer.

[...]
> +cdef int SimpleSet_traverse(SimpleSet self, visitproc visit, void *arg):
> +    """This is an implementation of 'tp_traverse' that hits the whole table.
> +
> +    Cython/Pyrex don't seem to let you define a tp_traverse, and they only
> +    define one for you if you have an 'object' attribute. Since they don't
> +    support C arrays of objects, we access the PyObject * directly.
> +    """
> +    cdef Py_ssize_t pos
> +    cdef PyObject *next_key
> +    cdef int ret
> +
> +    pos = 0
> +    while SimpleSet_Next(self, &pos, &next_key):
> +        ret = visit(next_key, arg)
> +        if ret:
> +            return ret
> +
> +    return 0;

You don't need the semi-colon ;)

I wonder if there's any risk that an object referenced by a SimpleSet might
somehow mutate that SimpleSet during this traverse?

> === added file 'bzrlib/tests/test__simple_set.py'
[...]
> +    def assertIn(self, obj, container):
> +        self.assertTrue(obj in container,
> +            '%s not found in %s' % (obj, container))
> +
> +    def assertNotIn(self, obj, container):
> +        self.assertTrue(obj not in container,
> +            'We found %s in %s' % (obj, container))

I was a bit surprised that we didn't seem to already have these!

> +    def assertRefcount(self, count, obj):
> +        """Assert that the refcount for obj is what we expect.
> +
> +        Note that this automatically adjusts for the fact that calling
> +        assertRefcount actually creates a new pointer, as does calling
> +        sys.getrefcount. So pass the expected value *before* the call.
> +        """
> +        # I'm not sure why the offset is 3, but I've check that in the caller,
> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
> +        # is incrementing/decrementing 2 times
> +        self.assertEqual(count, sys.getrefcount(obj)-3)

Eep.  This is going to be pretty fragile across CPython versions, let alone
non-CPython Pythons...

Can you perhaps use weakrefs instead?  They're enough to let you observe object
lifetimes (either via a callback, or just seeing that the weakref cannot be
resolved), although for portability you still need to call gc.collect() before
you can reliably know if the object is gone or not...

I guess we can worry about disabling the tests that need this on other Pythons
if/when we run on them...  (and of course we may not care about this code much
on non-CPython!).

> +    def test__lookup(self):
> +        # The tuple hash function is rather good at entropy. For all integers
> +        # 0=>1023, hash((i,)) & 1023 maps to a unique output, and hash((i,j))
> +        # maps to all 1024 fields evenly.
> +        # However, hash((c,d))& 1023 for characters has an uneven distribution
> +        # of collisions, for example:
> +        #  ('a', 'a'), ('f', '4'), ('p', 'r'), ('q', '1'), ('F', 'T'),
> +        #  ('Q', 'Q'), ('V', 'd'), ('7', 'C')
> +        # all collide @ 643
> +        obj = self.module.SimpleSet()
> +        offset, val = obj._test_lookup(('a', 'a'))
> +        self.assertEqual(643, offset)
[...]

Hmm, again pretty fragile due to depending on Python internals.  Oh well.

-Andrew.

review: Approve

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-09:

Download full text (15.8 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
...

>> + 1) Don't have all of the .intersection, .difference, etc functions
>> + 2) Do return the object from the set via queries
>> + eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
>> +"""
>
> I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
> It's a slightly unusual mix of dict-like and set-like APIs. I think it's fine
> after some thinking about it, I'm just noting that it seems a bit funny at
> first. Although maybe .remove would be more set-like than .__delitem__?

So at the moment, the only apis I need for interning are:
add() and discard()

I could certainly remove the __getitem__ and __delitem__ if you are more
comfortable with it.

I want to have an api that fits what we need/want to use. As such, it
will probably be driven by actual use cases. And so far, there is only 1
of those...

I considered changing the semantics, such that "__getitem__" == "add()".
The main reason for that is because it avoids the getattr() overhead, as
__getitem__ is a slot in the type struct (tp_item) while .add() requires
an attribute lookup.

Which basically means that in an interning loop you would have:

for bar in group:
bar = dedup[bar]

rather than

bar = dedup.add(bar)

or the old form

bar = dict.setdefault(bar, bar)

I would guess that doing

add = dedup.add
for bar in group:
bar = add(bar)

Is going to perform the same as the dedup[bar] form. But both should do
better than setdefault. (At a minimum, setdefault takes 2 parameters,
and thus has to create a tuple, etc.)

I don't know if the interpreter has further internal benefits to using
the __getitem__ form, since it is a known api that has to conform in
certain ways.

The big concern is that "dedup[bar]" can mutate 'dedup' and that is
potentially unexpected.

>
>> +cdef api object SimpleSet_Add(object self, object key)
>> +cdef api SimpleSet SimpleSet_New()
>> +cdef api object SimpleSet_Add(object self, object key)
>
> That appears to be a duplicate declaration of SimpleSet_Add.

Thanks.

...

> [...]
>> +cdef object _dummy_obj
>> +cdef PyObject *_dummy
>> +_dummy_obj = object()
>> +_dummy = <PyObject *>_dummy_obj
>
> It's not very clear what _dummy is used for. I guess what's missing is some
> text giving an overview of the data structure, although I suppose that's what
> you mean to convey by the docstrings saying it is similar to the builtin
> dict/set types.
>
> Anyway, it appears _dummy is used to avoid resizing/compacting the
> table for every single discard. [Part of why it would be nice to have some text
> describing the data structure is so that there's some clear terminology to
> use... the code pretty clearly talks about “tables” and “slots”, which seem
> fairly clear, but then what's an “entry”?]

How's this:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
...

>> +  1) Don't have all of the .intersection, .difference, etc functions
>> +  2) Do return the object from the set via queries
>> +     eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
>> +"""
> 
> I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
> It's a slightly unusual mix of dict-like and set-like APIs.  I think it's fine
> after some thinking about it, I'm just noting that it seems a bit funny at
> first.  Although maybe .remove would be more set-like than .__delitem__?

So at the moment, the only apis I need for interning are:
  add() and discard()

I could certainly remove the __getitem__ and __delitem__ if you are more
comfortable with it.

I want to have an api that fits what we need/want to use. As such, it
will probably be driven by actual use cases. And so far, there is only 1
of those...

Which basically means that in an interning loop you would have:

for bar in group:
  bar = dedup[bar]

rather than

bar = dedup.add(bar)

or the old form

bar = dict.setdefault(bar, bar)

I would guess that doing

add = dedup.add
for bar in group:
  bar = add(bar)

Is going to perform the same as the dedup[bar] form. But both should do
better than setdefault. (At a minimum, setdefault takes 2 parameters,
and thus has to create a tuple, etc.)

I don't know if the interpreter has further internal benefits to using
the __getitem__ form, since it is a known api that has to conform in
certain ways.

The big concern is that "dedup[bar]" can mutate 'dedup' and that is
potentially unexpected.

> 
>> +cdef api object SimpleSet_Add(object self, object key)
>> +cdef api SimpleSet SimpleSet_New()
>> +cdef api object SimpleSet_Add(object self, object key)
> 
> That appears to be a duplicate declaration of SimpleSet_Add.

Thanks.

...

> [...]
>> +cdef object _dummy_obj
>> +cdef PyObject *_dummy
>> +_dummy_obj = object()
>> +_dummy = <PyObject *>_dummy_obj
> 
> It's not very clear what _dummy is used for.  I guess what's missing is some
> text giving an overview of the data structure, although I suppose that's what
> you mean to convey by the docstrings saying it is similar to the builtin
> dict/set types.
> 
> Anyway, it appears _dummy is used to avoid resizing/compacting the
> table for every single discard.  [Part of why it would be nice to have some text
> describing the data structure is so that there's some clear terminology to
> use... the code pretty clearly talks about “tables” and “slots”, which seem
> fairly clear, but then what's an “entry”?]

How's this:

# Data structure definition:
 #   This is a basic hash table using open addressing.
 #       http://en.wikipedia.org/wiki/Open_addressing
 #   Basically that means we keep an array of pointers to Python objects
 #   (called a table). Each location in the array is called a 'slot'.
 #
 #   An empty slot holds a NULL pointer, a slot where there was an item
 #   which was then deleted will hold a pointer to _dummy, and a filled slot
 #   points at the actual object which fills that slot.
 #
 #   The table is always a power of two, and the default location where an
 #   object is inserted is at hash(object) & (table_size - 1)
 #
 #   If there is a collision, then we search for another location. The
 #   specific algorithm is in _lookup. We search until we:
 #       find the object
 #       find an equivalent object (by tp_richcompare(obj1, obj2, Py_EQ))
 #       find a NULL slot
 #
 #   When an object is deleted, we set its slot to _dummy. this way we don't
 #   have to track whether there was a collision, and find the corresponding
 #   keys. (The collision resolution algorithm makes that nearly impossible
 #   anyway, because it depends on the upper bits of the hash.)
 #   The main effect of this, is that if we find _dummy, then we can insert
 #   an object there, but we have to keep searching until we find NULL to
 #   know that the object is not present elsewhere.

> 
>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>> +    cdef long other_hash
>> +    cdef PyObject *res
>> +
>> +    if this == other:
>> +        return 1
>> +    other_hash = Py_TYPE(other).tp_hash(other)
> 
> What happens if 'other' is not hashable?

Other will always be something that is already held in the internal
structure, and thus has been hashed. 'this' has also already been hashed
as part of inserting, and thus we've also checked that it can be hashed.

I can change this to PyObject_Hash() if you feel strongly. I was
avoiding a function call overhead, though it probably doesn't have a
huge impact on performance.

> 
>> +    if other_hash != this_hash:
>> +        return 0
>> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
> 
> Similarly, what if one the richcompare calls raise an exception?

I'll admit that I took a few shortcuts, as I was originally thinking
that _table would be:

StaticTuple **_table

rather than a generic object. So I can...

1) Define _is_equal to 'except -1' and have Pyrex do exception checking
   on every query. And then I can check 'if res == NULL: return -1'.

2) Arguably doing "res == Py_True" isn't strictly correct either, and I
   should be using PyObject_IsTrue() here. However, the more generic we
   make this code, the slower it performs, and we have to ask whether it
   is worth being generic given that it is meant to be used with
   specific types...

3) I would consider going so far as to restrict this class to only
   supporting one type of object. (First object inserted sets the type,
   or it is set via the constructor.)

I'd like to avoid making the common case slow because of very unlikely
potential edge cases. But I'm willing to discuss where an appropriate
tradeoff is.

> 
>> +cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
> [...]
>> +    cdef int _insert_clean(self, PyObject *key) except -1:
>> +        """Insert a key into self.table.
>> +
>> +        This is only meant to be used during times like '_resize',
>> +        as it makes a lot of assuptions about keys not already being present,
>> +        and there being no dummy entries.
>> +        """
>> +        cdef size_t i, perturb, mask
>> +        cdef long the_hash
>> +        cdef PyObject **table, **entry
>> +
>> +        mask = self._mask
>> +        table = self._table
>> +
>> +        the_hash = Py_TYPE(key).tp_hash(key)
>> +        i = the_hash & mask
>> +        entry = &table[i]
>> +        perturb = the_hash
>> +        # Because we know that we made all items unique before, we can just
>> +        # iterate as long as the target location is not empty, we don't have to
>> +        # do any comparison, etc.
>> +        while entry[0] != NULL:
>> +            i = (i << 2) + i + perturb + 1
>> +            entry = &table[i & mask]
>> +            perturb >>= PERTURB_SHIFT
> 
> This is code borrowed from Python's dictobject.c/setobject.c, I think?  I think
> you need to:
> 
>  - point to the Python source, which explains this logic
>  - and maybe adjust the copyright & licence declaration.  I'm no expert, but
>    maybe we shouldn't be claiming sole ownership of this code... maybe put the
>    key parts like this that are fairly directly taken from Python (assuming I'm
>    right about that!) into a separate file, so we keep it clear which code is
>    solely ours, and which is mixed.  Possibly that's just this function and the
>    _lookup function?

The only bit that is close to copyrightable is:
>> +            i = (i << 2) + i + perturb + 1
>> +            entry = &table[i & mask]
>> +            perturb >>= PERTURB_SHIFT

Since it is an explicit combination of parameters. I'm happy to just
change it to something like:
 http://en.wikipedia.org/wiki/Quadratic_probing

Which is: trivial to implement, just a different set of performance
tradeoffs (you gain locality, at an increased chance for collisions),
and certainly a public domain solution that would avoid any copyright
issues.

> 
>> +    cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
> [...]
>> +        # We rolled over our signed size field
>> +        if new_size <= 0:
>> +            raise MemoryError()
> 
> It's good to see code being so pedantic :)

new_size is a signed integer, many things allocate using an unsigned
integer, so it *can* happen. But yeah, I didn't think of it myself.

> 
>> +    def discard(self, key):
>> +        """Remove key from the dict, whether it exists or not.
>> +
>> +        :return: 0 if the item did not exist, 1 if it did
>> +        """
>> +        return self._discard(key)
> 
> It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
> that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...
>

Originally I was using:

cpdef bint discard(self, key):
  ...

Which is Cython code that defines both a Python version and a C version,
and casts integers into Booleans. I can certainly just do:

if self._discard(key);
  return True
return False

>> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
>> +    """Find the slot where 'key' would fit.
>> +
>> +    This is the same as a dicts 'lookup' function.
>> +
>> +    :param key: An object we are looking up
>> +    :param hash: The hash for key
>> +    :return: The location in self.table where key should be put
>> +        should never be NULL, but may reference a NULL (PyObject*)
>> +    """
> 
> I don't quite follow the “should never be NULL, but may reference a NULL” part
> of that docstring.
>

I shouldn't be using the term 'dict' here either.

Anyway, I'm returning a **. I'll try to clarify. But in essence:

PyObject **result = _lookup(...);

assert(result != NULL)
if (*result) == NULL: # nothing is stored here

> [...]
>> +    if cur[0] == NULL:
>> +        # Found a blank spot, or found the exact key
>> +        return cur
>> +    if cur[0] == py_key:
>> +        return cur
>> +    if cur[0] == _dummy:
>> +        free_slot = cur
>> +    else:
>> +        if _is_equal(py_key, key_hash, cur[0]):
>> +            # Both py_key and cur[0] belong in this slot, return it
>> +            return cur
>> +        free_slot = NULL
>> +    # size_t is unsigned, hash is signed...
>> +    perturb = key_hash
>> +    while True:
>> +        i = (i << 2) + i + perturb + 1
>> +        cur = &table[i & mask]
>> +        if cur[0] == NULL: # Found an empty spot
>> +            if free_slot: # Did we find a _dummy earlier?
>> +                return free_slot
>> +            else:
>> +                return cur
>> +        if (cur[0] == py_key # exact match
>> +            or _is_equal(py_key, key_hash, cur[0])): # Equivalent match
>> +            return cur
>> +        if (cur[0] == _dummy and free_slot == NULL):
>> +            free_slot = cur
>> +        perturb >>= PERTURB_SHIFT
>> +    raise AssertionError('should never get here')
> 
> Hmm, this appears to be essentially:
> 
> <loop body>
> while True:
>     <increment stuff>
>     <loop body>
> 
> Why can't it be:
> 
> while True:
>     <loop body>
>     <increment stuff>

So, there are a few subtle differences, but I'm not sure if it is worth
the code duplication. Namely

1) The first lookup is going to be the right location 90+% of the time.
(if things are sized correctly, and your hash is working well, etc.) So
you want to avoid any overhead that you can.

2) One of that is checking whether free_slot is NULL or not.

3) The loop code changes the order of checking for _dummy with checking
   _is_equal, because in the claim is that _dummy rare enough to offset
   the speed of the check with the speed of the comparison. I guess I'm
   doubtful, too.

> 
> ?
> 
>> +# TODO: this should probably have direct tests, since it isn't used by __iter__
>> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):
> 
> I agree with that TODO ;)
> 
> Although getting rid of the duplication between this and
> _SimpleSet_iterator.__next__ would be even nicer.

It costs a function call per item, but __next__ is already a python
function call, so I'll just do it. We don't really list(set) often
anyway. (Current use cases *never* do so...)

...

> You don't need the semi-colon ;)
> 
> I wonder if there's any risk that an object referenced by a SimpleSet might
> somehow mutate that SimpleSet during this traverse?

This is the same basic implementation that is used by dicts and sets.
Namely, tp_traverse is used by the garbage collector, to 'doing stuff'
is supposed to be forbidden (I assume).

> 
>> +    def assertRefcount(self, count, obj):
>> +        """Assert that the refcount for obj is what we expect.
>> +
>> +        Note that this automatically adjusts for the fact that calling
>> +        assertRefcount actually creates a new pointer, as does calling
>> +        sys.getrefcount. So pass the expected value *before* the call.
>> +        """
>> +        # I'm not sure why the offset is 3, but I've check that in the caller,
>> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
>> +        # is incrementing/decrementing 2 times
>> +        self.assertEqual(count, sys.getrefcount(obj)-3)
> 
> Eep.  This is going to be pretty fragile across CPython versions, let alone
> non-CPython Pythons...

Do you really think that refcounts are going to be fragile across
CPython versions?

> 
> Can you perhaps use weakrefs instead?  They're enough to let you observe object
> lifetimes (either via a callback, or just seeing that the weakref cannot be
> resolved), although for portability you still need to call gc.collect() before
> you can reliably know if the object is gone or not...

tuples can't have weakrefs. I really *do* want to assert that my
refcounts are correct. And Pyrex code is *not* portable to anything but
CPython.

> 
> I guess we can worry about disabling the tests that need this on other Pythons
> if/when we run on them...  (and of course we may not care about this code much
> on non-CPython!).
>

Exactly. You can't write a C extension to Jython or IronPython...

>> +    def test__lookup(self):
>> +        # The tuple hash function is rather good at entropy. For all integers
>> +        # 0=>1023, hash((i,)) & 1023 maps to a unique output, and hash((i,j))
>> +        # maps to all 1024 fields evenly.
>> +        # However, hash((c,d))& 1023 for characters has an uneven distribution
>> +        # of collisions, for example:
>> +        #  ('a', 'a'), ('f', '4'), ('p', 'r'), ('q', '1'), ('F', 'T'),
>> +        #  ('Q', 'Q'), ('V', 'd'), ('7', 'C')
>> +        # all collide @ 643
>> +        obj = self.module.SimpleSet()
>> +        offset, val = obj._test_lookup(('a', 'a'))
>> +        self.assertEqual(643, offset)
> [...]
> 
> Hmm, again pretty fragile due to depending on Python internals.  Oh well.
> 
> -Andrew.

If you feel it would be less fragile, I can change that to use the
integers corresponding to hash('a') or hash('a', 'a'). I really wanted
to make sure things were working with StaticTuple, which uses the hash
algorithm from tuple itself.

This code originally used all StaticTuple instances, until I refactored
it for review. And as I mentioned, it originally wasn't going to allow
arbitrary objects.

However, since it now *does*, I could just use a bunch of ints. It still
depends on hash() details, which is python implementation specific... I
could write my own class that returns __hash__ that I want, but it seems
a bit YAGNI.

I'll simplify it, though, if it makes you feel more comfortable.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrPZx4ACgkQJdeBCYSNAAMJEwCfZDUETOQoPjySDGKLhoPRa9Sn
NMkAn2ttsmaFvkrJxzYSkzuuEUdW5bgh
=jPzP
-----END PGP SIGNATURE-----

Revision history for this message

Andrew Bennetts (spiv) wrote on 2009-10-12:

Download full text (12.2 KiB)

John A Meinel wrote:
[...]
> I would guess that doing
>
> add = dedup.add
> for bar in group:
> bar = add(bar)
>
> Is going to perform the same as the dedup[bar] form. But both should do
> better than setdefault. (At a minimum, setdefault takes 2 parameters,
> and thus has to create a tuple, etc.)
>
> I don't know if the interpreter has further internal benefits to using
> the __getitem__ form, since it is a known api that has to conform in
> certain ways.

Right, I think you'd still see a bit of benefit in the __getitem__ form because
Python knows that that operator only takes one arg and so avoids packing and
unpacking an args tuple for it. (Like METH_O vs. METH_VARARGS.) ISTR some part
of the zope.interface C extension intentionally abuses an obscure no-arg
operator (maybe pos?) as that's the fastest way to invoke C code from Python.

> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
> potentially unexpected.

Agreed. For readability my preference is .add, but obviously sufficiently large
performance benefits may override that. I don't *think* the boost from using
__getitem__ instead would be that much greater for this use, but I haven't
measured so I may be very wrong :)

[...]
> How's this:
>
> # Data structure definition:

Great!

> >> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> >> + cdef long other_hash
> >> + cdef PyObject *res
> >> +
> >> + if this == other:
> >> + return 1
> >> + other_hash = Py_TYPE(other).tp_hash(other)
> >
> > What happens if 'other' is not hashable?
>
> Other will always be something that is already held in the internal
> structure, and thus has been hashed. 'this' has also already been hashed
> as part of inserting, and thus we've also checked that it can be hashed.
>
> I can change this to PyObject_Hash() if you feel strongly. I was
> avoiding a function call overhead, though it probably doesn't have a
> huge impact on performance.

Well, badly-behaved objects might successfully give a hash value the first time
and then get an error later. I do strongly lean towards paranoia in C code, so
I think PyObject_Hash is probably safest. You ought to be able to construct a
test that defines such a badly-behaved object to see just how bad the fallout
is. If segfaults are possible then definitely close that hole!

Also, PyObject_Hash is where Python implements the logic that an object with no
tp_hash in its type will use hash(id(obj)), so long as it doesn't have
tp_compare or tp_richcompare. I think this is pretty common, e.g. the module
type has no tp_hash. And given that you use the 'hash' global elsewhere in this
module there's definitely some chance for confusion here.

But for the case you're optimising for PyObject_Hash will just call the object's
tp_hash immediately and return the result, so it should be a minimal penalty,
just a C function call overhead.

The thing I was really looking for though was checking the result value of
tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

> >> + if other_hash != this_hash:
> >> + return 0
> >> + res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
>...

John A Meinel wrote:
[...]
> I would guess that doing
> 
> add = dedup.add
> for bar in group:
>   bar = add(bar)
> 
> Is going to perform the same as the dedup[bar] form. But both should do
> better than setdefault. (At a minimum, setdefault takes 2 parameters,
> and thus has to create a tuple, etc.)
> 
> I don't know if the interpreter has further internal benefits to using
> the __getitem__ form, since it is a known api that has to conform in
> certain ways.

Right, I think you'd still see a bit of benefit in the __getitem__ form because
Python knows that that operator only takes one arg and so avoids packing and
unpacking an args tuple for it.  (Like METH_O vs. METH_VARARGS.)  ISTR some part
of the zope.interface C extension intentionally abuses an obscure no-arg
operator (maybe pos?) as that's the fastest way to invoke C code from Python.

> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
> potentially unexpected.

Agreed.  For readability my preference is .add, but obviously sufficiently large
performance benefits may override that.  I don't *think* the boost from using
__getitem__ instead would be that much greater for this use, but I haven't
measured so I may be very wrong :)

[...]
> How's this:
> 
>  # Data structure definition:

Great!

> >> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> >> +    cdef long other_hash
> >> +    cdef PyObject *res
> >> +
> >> +    if this == other:
> >> +        return 1
> >> +    other_hash = Py_TYPE(other).tp_hash(other)
> > 
> > What happens if 'other' is not hashable?
> 
> Other will always be something that is already held in the internal
> structure, and thus has been hashed. 'this' has also already been hashed
> as part of inserting, and thus we've also checked that it can be hashed.
> 
> I can change this to PyObject_Hash() if you feel strongly. I was
> avoiding a function call overhead, though it probably doesn't have a
> huge impact on performance.

Well, badly-behaved objects might successfully give a hash value the first time
and then get an error later.  I do strongly lean towards paranoia in C code, so
I think PyObject_Hash is probably safest.  You ought to be able to construct a
test that defines such a badly-behaved object to see just how bad the fallout
is.  If segfaults are possible then definitely close that hole!

Also, PyObject_Hash is where Python implements the logic that an object with no
tp_hash in its type will use hash(id(obj)), so long as it doesn't have
tp_compare or tp_richcompare.  I think this is pretty common, e.g. the module
type has no tp_hash.  And given that you use the 'hash' global elsewhere in this
module there's definitely some chance for confusion here.

But for the case you're optimising for PyObject_Hash will just call the object's
tp_hash immediately and return the result, so it should be a minimal penalty,
just a C function call overhead.

The thing I was really looking for though was checking the result value of
tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

> >> +    if other_hash != this_hash:
> >> +        return 0
> >> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
> > 
> > Similarly, what if one the richcompare calls raise an exception?
> 
> I'll admit that I took a few shortcuts, as I was originally thinking
> that _table would be:
> 
>  StaticTuple **_table
> 
> rather than a generic object. So I can...
> 
> 1) Define _is_equal to 'except -1' and have Pyrex do exception checking
>    on every query. And then I can check 'if res == NULL: return -1'.

I think this is a must, and should be pretty cheap.

> 2) Arguably doing "res == Py_True" isn't strictly correct either, and I
>    should be using PyObject_IsTrue() here. However, the more generic we
>    make this code, the slower it performs, and we have to ask whether it
>    is worth being generic given that it is meant to be used with
>    specific types...

Well, if you want to restrict it to particular types, make the code reject them
from being added.  Otherwise someone will use this code in unexpected ways
sooner or later :)

Again, PyObject_IsTrue tries == Py_True as the very first thing, so the cost
should be small.  If you're worried a macro along the lines of:

#define PY_IS_TRUE(o) ((o == Py_True) || PyObject_IsTrue(o))

might make us both happy.

> 3) I would consider going so far as to restrict this class to only
>    supporting one type of object. (First object inserted sets the type,
>    or it is set via the constructor.)

Ah, I see you're thinking along the same lines as me.

> I'd like to avoid making the common case slow because of very unlikely
> potential edge cases. But I'm willing to discuss where an appropriate
> tradeoff is.

Sure.  I do think we'll be tempted to use things like this for more than your
original purpose if it works as well as it sounds like it will... so if it's
cheap to make it more widely applicable, or cheap to make it only work with
objects it's intended to work with, then we should do that.  If nothing else,
our tests have a tendency to substitute regular objects with strange test
doubles...

[...]
> The only bit that is close to copyrightable is:
> >> +            i = (i << 2) + i + perturb + 1
> >> +            entry = &table[i & mask]
> >> +            perturb >>= PERTURB_SHIFT
> 
> Since it is an explicit combination of parameters. I'm happy to just
> change it to something like:
>  http://en.wikipedia.org/wiki/Quadratic_probing
> 
> Which is: trivial to implement, just a different set of performance
> tradeoffs (you gain locality, at an increased chance for collisions),
> and certainly a public domain solution that would avoid any copyright
> issues.

I'd lean towards that then, simply because I'm not even close to qualified to
say what's safe to borrow.  If the performance is about the same then I don't
see any downside.

(Although, for a table of pointers, locality is probably less interesting, but
it may still help subsequent probes when there is a collision...)

[...]
> > It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
> > that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...
> >
> 
> Originally I was using:
> 
> cpdef bint discard(self, key):
>   ...
> 
> Which is Cython code that defines both a Python version and a C version,
> and casts integers into Booleans. I can certainly just do:

That's a nice feature!  A good way to keep the C and Python interfaces
harmonised.

> if self._discard(key);
>   return True
> return False

I think you may as well; I doubt it will make a difference to performance and it
will make the interface feel just that tiny bit nicer.

> >> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
> >> +    """Find the slot where 'key' would fit.
> >> +
> >> +    This is the same as a dicts 'lookup' function.
> >> +
> >> +    :param key: An object we are looking up
> >> +    :param hash: The hash for key
> >> +    :return: The location in self.table where key should be put
> >> +        should never be NULL, but may reference a NULL (PyObject*)
> >> +    """
> > 
> > I don't quite follow the “should never be NULL, but may reference a NULL” part
> > of that docstring.
> > 
> 
> I shouldn't be using the term 'dict' here either.
> 
> Anyway, I'm returning a **. I'll try to clarify. But in essence:
> 
> PyObject **result = _lookup(...);
> 
> assert(result != NULL)
> if (*result) == NULL: # nothing is stored here

Oh, I see, because it's returning a pointer to a pointer.  I think I'd just say
“The location in self.table where should be put.  Will not be NULL.”  I think
it should go without saying that a particular slot may or may not be NULL.  If
you still want to mention it I'd say “(but the slot itself may point to NULL)”.

> > Hmm, this appears to be essentially:
> > 
> > <loop body>
> > while True:
> >     <increment stuff>
> >     <loop body>
> > 
> > Why can't it be:
> > 
> > while True:
> >     <loop body>
> >     <increment stuff>
> 
> So, there are a few subtle differences, but I'm not sure if it is worth
> the code duplication. Namely
> 
> 1) The first lookup is going to be the right location 90+% of the time.
> (if things are sized correctly, and your hash is working well, etc.) So
> you want to avoid any overhead that you can.
> 
> 2) One of that is checking whether free_slot is NULL or not.
> 
> 3) The loop code changes the order of checking for _dummy with checking
>    _is_equal, because in the claim is that _dummy rare enough to offset
>    the speed of the check with the speed of the comparison. I guess I'm
>    doubtful, too.

On the other hand maybe more compact code will fit better in the cache ;)

If the benefit is doubtful I'd go for simpler and shorter, myself.  If you'd
rather not bother, then at least put a comment there saying you've special-cased
the first iteration for performance because it should be all that's needed 90+%
of the time.

> > You don't need the semi-colon ;)
> > 
> > I wonder if there's any risk that an object referenced by a SimpleSet might
> > somehow mutate that SimpleSet during this traverse?
> 
> This is the same basic implementation that is used by dicts and sets.
> Namely, tp_traverse is used by the garbage collector, to 'doing stuff'
> is supposed to be forbidden (I assume).

Cool, that's what I was hoping.

> >> +    def assertRefcount(self, count, obj):
> >> +        """Assert that the refcount for obj is what we expect.
> >> +
> >> +        Note that this automatically adjusts for the fact that calling
> >> +        assertRefcount actually creates a new pointer, as does calling
> >> +        sys.getrefcount. So pass the expected value *before* the call.
> >> +        """
> >> +        # I'm not sure why the offset is 3, but I've check that in the caller,
> >> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
> >> +        # is incrementing/decrementing 2 times
> >> +        self.assertEqual(count, sys.getrefcount(obj)-3)
> > 
> > Eep.  This is going to be pretty fragile across CPython versions, let alone
> > non-CPython Pythons...
> 
> Do you really think that refcounts are going to be fragile across
> CPython versions?

Well, probably not, but it wouldn't really surprise me either.

Another possible issue: this may break when run inside a debugger.

> > Can you perhaps use weakrefs instead?  They're enough to let you observe object
> > lifetimes (either via a callback, or just seeing that the weakref cannot be
> > resolved), although for portability you still need to call gc.collect() before
> > you can reliably know if the object is gone or not...
> 
> tuples can't have weakrefs. I really *do* want to assert that my
> refcounts are correct. And Pyrex code is *not* portable to anything but
> CPython.

Well, there are competing implementations of CPython these days, i.e. Unladen
Swallow, which is experimenting with garbage collection...

> > I guess we can worry about disabling the tests that need this on other Pythons
> > if/when we run on them...  (and of course we may not care about this code much
> > on non-CPython!).
> > 
> 
> Exactly. You can't write a C extension to Jython or IronPython...

There's work underway to make it possible to use CPython extensions in
IronPython: <http://www.resolversystems.com/news/?p=17>...

Anyway, I'm ok with asserting things about refcounts, just so long as it's kept
to a minimum :)

[...]
> > Hmm, again pretty fragile due to depending on Python internals.  Oh well.
> > 
> > -Andrew.
> 
> If you feel it would be less fragile, I can change that to use the
> integers corresponding to hash('a') or hash('a', 'a'). I really wanted
> to make sure things were working with StaticTuple, which uses the hash
> algorithm from tuple itself.
> 
> This code originally used all StaticTuple instances, until I refactored
> it for review. And as I mentioned, it originally wasn't going to allow
> arbitrary objects.
> 
> However, since it now *does*, I could just use a bunch of ints. It still
> depends on hash() details, which is python implementation specific... I
> could write my own class that returns __hash__ that I want, but it seems
> a bit YAGNI.
> 
> I'll simplify it, though, if it makes you feel more comfortable.

I'm ok with it as is.  I'm comfortable that we can easily enough change it later
if it does become a problem, and if it never does then even better ;)

-Andrew.

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-12:

Download full text (9.4 KiB)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> I would guess that doing
>>
>> add = dedup.add
>> for bar in group:
>> bar = add(bar)
>>
>> Is going to perform the same as the dedup[bar] form. But both should do
>> better than setdefault. (At a minimum, setdefault takes 2 parameters,
>> and thus has to create a tuple, etc.)
>>
>> I don't know if the interpreter has further internal benefits to using
>> the __getitem__ form, since it is a known api that has to conform in
>> certain ways.
>
> Right, I think you'd still see a bit of benefit in the __getitem__ form because
> Python knows that that operator only takes one arg and so avoids packing and
> unpacking an args tuple for it. (Like METH_O vs. METH_VARARGS.) ISTR some part
> of the zope.interface C extension intentionally abuses an obscure no-arg
> operator (maybe pos?) as that's the fastest way to invoke C code from Python.
>
>> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
>> potentially unexpected.
>
> Agreed. For readability my preference is .add, but obviously sufficiently large
> performance benefits may override that. I don't *think* the boost from using
> __getitem__ instead would be that much greater for this use, but I haven't
> measured so I may be very wrong :)
>
> [...]
>> How's this:
>>
>> # Data structure definition:
>
> Great!
>
>>>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>>>> + cdef long other_hash
>>>> + cdef PyObject *res
>>>> +
>>>> + if this == other:
>>>> + return 1
>>>> + other_hash = Py_TYPE(other).tp_hash(other)
>>> What happens if 'other' is not hashable?
>> Other will always be something that is already held in the internal
>> structure, and thus has been hashed. 'this' has also already been hashed
>> as part of inserting, and thus we've also checked that it can be hashed.
>>
>> I can change this to PyObject_Hash() if you feel strongly. I was
>> avoiding a function call overhead, though it probably doesn't have a
>> huge impact on performance.
>
> Well, badly-behaved objects might successfully give a hash value the first time
> and then get an error later. I do strongly lean towards paranoia in C code, so
> I think PyObject_Hash is probably safest. You ought to be able to construct a
> test that defines such a badly-behaved object to see just how bad the fallout
> is. If segfaults are possible then definitely close that hole!

No segfaults, but the potential to leave an error in the pipe. Which
causes random failures IIRC. (I forget the function where '-1' is maybe
an error, so you have to check PyErr_Occurred, etc.)

>
> Also, PyObject_Hash is where Python implements the logic that an object with no
> tp_hash in its type will use hash(id(obj)), so long as it doesn't have
> tp_compare or tp_richcompare. I think this is pretty common, e.g. the module
> type has no tp_hash. And given that you use the 'hash' global elsewhere in this
> module there's definitely some chance for confusion here.

So in my "_insert" I now have the assertion:
if (Py_TYPE(py_key).tp_richcompare == NULL
or Py_TYPE(py_key).tp_...

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

Subscribers

People subscribed via source and target branches

to all changes:

Alejandro Cornejo2

Bazaar Codereview Subscribers

Benoit Pierre

Gmood

John A Meinel

Karl Bielefeldt

Mahmoud Hassan

Matt Nordhoff

Mohd Fikri Mohd Amin

MrJOHN

Václav Haisman

bzr PQM

vincenzo

to status/vote changes:

Alexander Belchenko

amandla2023

Bazaar

Merge lp:~jameinel/bzr/2.1-simple-set into lp:bzr

Commit message

Description of the change

Preview Diff

Subscribers

 === modified file '.bzrignore'
 --- .bzrignore	2009-09-09 11:43:10 +0000
 +++ .bzrignore	2009-10-12 16:51:14 +0000
@@ -58,6 +58,9 @@
  bzrlib/_known_graph_pyx.c
  bzrlib/_readdir_pyx.c
  bzrlib/_rio_pyx.c
++bzrlib/_simple_set_pyx.c
++bzrlib/_simple_set_pyx.h
++bzrlib/_simple_set_pyx_api.h
  bzrlib/_walkdirs_win32.c
  # built extension modules
  bzrlib/_*.dll
 === modified file 'NEWS'
 --- NEWS	2009-10-08 23:44:40 +0000
 +++ NEWS	2009-10-12 16:51:14 +0000
@@ -206,6 +206,13 @@
    repository or branch object is unlocked then relocked the same way.
    (Andrew Bennetts)
++* Added ``bzrlib._simple_set_pyx``. This is a hybrid between a Set and a
++  Dict (it only holds keys, but you can lookup the object located at a
++  given key). It has significantly reduced memory consumption versus the
++  builtin objects (1/2 the size of Set, 1/3rd the size of Dict). This will
++  be used as the interning structure for StaticTuple objects, as part of
++  an ongoing push to reduce peak memory consumption.  (John Arbash Meinel)
++
  * ``BTreeLeafParser.extract_key`` has been tweaked slightly to reduce
    mallocs while parsing the index (approx 3=>1 mallocs per key read).
    This results in a 10% speedup while reading an index.
 === added file 'bzrlib/_simple_set_pyx.pxd'
 --- bzrlib/_simple_set_pyx.pxd	1970-01-01 00:00:00 +0000
 +++ bzrlib/_simple_set_pyx.pxd	2009-10-12 16:51:14 +0000
@@ -0,0 +1,91 @@
++# Copyright (C) 2009 Canonical Ltd
++#
++# This program is free software; you can redistribute it and/or modify
++# it under the terms of the GNU General Public License as published by
++# the Free Software Foundation; either version 2 of the License, or
++# (at your option) any later version.
++#
++# This program is distributed in the hope that it will be useful,
++# but WITHOUT ANY WARRANTY; without even the implied warranty of
++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++# GNU General Public License for more details.
++#
++# You should have received a copy of the GNU General Public License
++# along with this program; if not, write to the Free Software
++# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
++
++"""Interface definition of a class like PySet but without caching the hash.
++
++This is generally useful when you want to 'intern' objects, etc. Note that this
++differs from Set in that we:
++  1) Don't have all of the .intersection, .difference, etc functions
++  2) Do return the object from the set via queries
++     eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
++"""
++
++cdef extern from "Python.h":
++    ctypedef struct PyObject:
++        pass
++
++
++cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
++    """A class similar to PySet, but with simpler implementation.
++
++    The main advantage is that this class uses only 2N memory to store N
++    objects rather than 4N memory. The main trade-off is that we do not cache
++    the hash value of saved objects. As such, it is assumed that computing the
++    hash will be cheap (such as strings or tuples of strings, etc.)
++
++    This also differs in that you can get back the objects that are stored
++    (like a dict), but we also don't implement the complete list of 'set'
++    operations (difference, intersection, etc).
++    """
++    # Data structure definition:
++    #   This is a basic hash table using open addressing.
++    #       http://en.wikipedia.org/wiki/Open_addressing
++    #   Basically that means we keep an array of pointers to Python objects
++    #   (called a table). Each location in the array is called a 'slot'.
++    #
++    #   An empty slot holds a NULL pointer, a slot where there was an item
++    #   which was then deleted will hold a pointer to _dummy, and a filled slot
++    #   points at the actual object which fills that slot.
++    #
++    #   The table is always a power of two, and the default location where an
++    #   object is inserted is at hash(object) & (table_size - 1)
++    #
++    #   If there is a collision, then we search for another location. The
++    #   specific algorithm is in _lookup. We search until we:
++    #       find the object
++    #       find an equivalent object (by tp_richcompare(obj1, obj2, Py_EQ))
++    #       find a NULL slot
++    #
++    #   When an object is deleted, we set its slot to _dummy. this way we don't
++    #   have to track whether there was a collision, and find the corresponding
++    #   keys. (The collision resolution algorithm makes that nearly impossible
++    #   anyway, because it depends on the upper bits of the hash.)
++    #   The main effect of this, is that if we find _dummy, then we can insert
++    #   an object there, but we have to keep searching until we find NULL to
++    #   know that the object is not present elsewhere.
++
++    cdef Py_ssize_t _used   # active
++    cdef Py_ssize_t _fill   # active + dummy
++    cdef Py_ssize_t _mask   # Table contains (mask+1) slots, a power of 2
++    cdef PyObject **_table  # Pyrex/Cython doesn't support arrays to 'object'
++                            # so we manage it manually
++
++    cdef PyObject *_get(self, object key) except? NULL
++    cdef object _add(self, key)
++    cdef int _discard(self, key) except -1
++    cdef int _insert_clean(self, PyObject *key) except -1
++    cdef Py_ssize_t _resize(self, Py_ssize_t min_unused) except -1
++
++
++# TODO: might want to export the C api here, though it is all available from
++#       the class object...
++cdef api SimpleSet SimpleSet_New()
++cdef api object SimpleSet_Add(object self, object key)
++cdef api int SimpleSet_Contains(object self, object key) except -1
++cdef api int SimpleSet_Discard(object self, object key) except -1
++cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
++cdef api Py_ssize_t SimpleSet_Size(object self) except -1
++cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
 === added file 'bzrlib/_simple_set_pyx.pyx'
 --- bzrlib/_simple_set_pyx.pyx	1970-01-01 00:00:00 +0000
 +++ bzrlib/_simple_set_pyx.pyx	2009-10-12 16:51:14 +0000
@@ -0,0 +1,600 @@
++# Copyright (C) 2009 Canonical Ltd
++#
++# This program is free software; you can redistribute it and/or modify
++# it under the terms of the GNU General Public License as published by
++# the Free Software Foundation; either version 2 of the License, or
++# (at your option) any later version.
++#
++# This program is distributed in the hope that it will be useful,
++# but WITHOUT ANY WARRANTY; without even the implied warranty of
++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++# GNU General Public License for more details.
++#
++# You should have received a copy of the GNU General Public License
++# along with this program; if not, write to the Free Software
++# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
++
++"""Definition of a class that is similar to Set with some small changes."""
++
++cdef extern from "python-compat.h":
++    pass
++
++cdef extern from "Python.h":
++    ctypedef unsigned long size_t
++    ctypedef long (*hashfunc)(PyObject*)
++    ctypedef PyObject *(*richcmpfunc)(PyObject *, PyObject *, int)
++    ctypedef int (*visitproc)(PyObject *, void *)
++    ctypedef int (*traverseproc)(PyObject *, visitproc, void *)
++    int Py_EQ
++    PyObject *Py_True
++    PyObject *Py_NotImplemented
++    void Py_INCREF(PyObject *)
++    void Py_DECREF(PyObject *)
++    ctypedef struct PyTypeObject:
++        hashfunc tp_hash
++        richcmpfunc tp_richcompare
++        traverseproc tp_traverse
++
++    PyTypeObject *Py_TYPE(PyObject *)
++    int PyObject_IsTrue(PyObject *)
++
++    void *PyMem_Malloc(size_t nbytes)
++    void PyMem_Free(void *)
++    void memset(void *, int, size_t)
++
++
++# Dummy is an object used to mark nodes that have been deleted. Since
++# collisions require us to move a node to an alternative location, if we just
++# set an entry to NULL on delete, we won't find any relocated nodes.
++# We have to use _dummy_obj because we need to keep a refcount to it, but we
++# also use _dummy as a pointer, because it avoids having to put <PyObject*> all
++# over the code base.
++cdef object _dummy_obj
++cdef PyObject *_dummy
++_dummy_obj = object()
++_dummy = <PyObject *>_dummy_obj
++
++
++cdef int _is_equal(PyObject *this, long this_hash, PyObject *other) except -1:
++    cdef long other_hash
++    cdef PyObject *res
++
++    if this == other:
++        return 1
++    other_hash = Py_TYPE(other).tp_hash(other)
++    if other_hash == -1:
++        # Even though other successfully hashed in the past, it seems to have
++        # changed its mind, and failed this time, so propogate the failure.
++        return -1
++    if other_hash != this_hash:
++        return 0
++
++    # This implements a subset of the PyObject_RichCompareBool functionality.
++    # Namely it:
++    #   1) Doesn't try to do anything with old-style classes
++    #   2) Assumes that both objects have a tp_richcompare implementation, and
++    #      that if that is not enough to compare equal, then they are not
++    #      equal. (It doesn't try to cast them both to some intermediate form
++    #      that would compare equal.)
++    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
++    if res == NULL: # Exception
++        return -1
++    if PyObject_IsTrue(res):
++        Py_DECREF(res)
++        return 1
++    if res == Py_NotImplemented:
++        Py_DECREF(res)
++        res = Py_TYPE(other).tp_richcompare(other, this, Py_EQ)
++    if res == NULL:
++        return -1
++    if PyObject_IsTrue(res):
++        Py_DECREF(res)
++        return 1
++    Py_DECREF(res)
++    return 0
++
++
++cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
++    """This class can be used to track canonical forms for objects.
++
++    It is similar in function to the interned dictionary that is used by
++    strings. However:
++
++      1) It assumes that hash(obj) is cheap, so does not need to inline a copy
++         of it
++      2) It only stores one reference to the object, rather than 2 (key vs
++         key:value)
++
++    As such, it uses 1/3rd the amount of memory to store a pointer to the
++    interned object.
++    """
++    # Attributes are defined in the .pxd file
++    DEF DEFAULT_SIZE=1024
++
++    def __init__(self):
++        cdef Py_ssize_t size, n_bytes
++
++        size = DEFAULT_SIZE
++        self._mask = size - 1
++        self._used = 0
++        self._fill = 0
++        n_bytes = sizeof(PyObject*) * size;
++        self._table = <PyObject **>PyMem_Malloc(n_bytes)
++        if self._table == NULL:
++            raise MemoryError()
++        memset(self._table, 0, n_bytes)
++
++    def __dealloc__(self):
++        if self._table != NULL:
++            PyMem_Free(self._table)
++            self._table = NULL
++
++    property used:
++        def __get__(self):
++            return self._used
++
++    property fill:
++        def __get__(self):
++            return self._fill
++
++    property mask:
++        def __get__(self):
++            return self._mask
++
++    def _memory_size(self):
++        """Return the number of bytes of memory consumed by this class."""
++        return sizeof(self) + (sizeof(PyObject*)*(self._mask + 1))
++
++    def __len__(self):
++        return self._used
++
++    def _test_lookup(self, key):
++        cdef PyObject **slot
++
++        slot = _lookup(self, key)
++        if slot[0] == NULL:
++            res = '<null>'
++        elif slot[0] == _dummy:
++            res = '<dummy>'
++        else:
++            res = <object>slot[0]
++        return <int>(slot - self._table), res
++
++    def __contains__(self, key):
++        """Is key present in this SimpleSet."""
++        cdef PyObject **slot
++
++        slot = _lookup(self, key)
++        if slot[0] == NULL or slot[0] == _dummy:
++            return False
++        return True
++
++    cdef PyObject *_get(self, object key) except? NULL:
++        """Return the object (or nothing) define at the given location."""
++        cdef PyObject **slot
++
++        slot = _lookup(self, key)
++        if slot[0] == NULL or slot[0] == _dummy:
++            return NULL
++        return slot[0]
++
++    def __getitem__(self, key):
++        """Return a stored item that is equivalent to key."""
++        cdef PyObject *py_val
++
++        py_val = self._get(key)
++        if py_val == NULL:
++            raise KeyError("Key %s is not present" % key)
++        val = <object>(py_val)
++        return val
++
++    cdef int _insert_clean(self, PyObject *key) except -1:
++        """Insert a key into self.table.
++
++        This is only meant to be used during times like '_resize',
++        as it makes a lot of assuptions about keys not already being present,
++        and there being no dummy entries.
++        """
++        cdef size_t i, n_lookup
++        cdef long the_hash
++        cdef PyObject **table, **slot
++        cdef Py_ssize_t mask
++
++        mask = self._mask
++        table = self._table
++
++        the_hash = Py_TYPE(key).tp_hash(key)
++        if the_hash == -1:
++            return -1
++        i = the_hash
++        for n_lookup from 0 <= n_lookup <= <size_t>mask: # Don't loop forever
++            slot = &table[i & mask]
++            if slot[0] == NULL:
++                slot[0] = key
++                self._fill = self._fill + 1
++                self._used = self._used + 1
++                return 1
++            i = i + 1 + n_lookup
++        raise RuntimeError('ran out of slots.')
++
++    def _py_resize(self, min_used):
++        """Do not use this directly, it is only exposed for testing."""
++        return self._resize(min_used)
++
++    cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
++        """Resize the internal table.
++
++        The final table will be big enough to hold at least min_used entries.
++        We will copy the data from the existing table over, leaving out dummy
++        entries.
++
++        :return: The new size of the internal table
++        """
++        cdef Py_ssize_t new_size, n_bytes, remaining
++        cdef PyObject **new_table, **old_table, **slot
++
++        new_size = DEFAULT_SIZE
++        while new_size <= min_used and new_size > 0:
++            new_size = new_size << 1
++        # We rolled over our signed size field
++        if new_size <= 0:
++            raise MemoryError()
++        # Even if min_used == self._mask + 1, and we aren't changing the actual
++        # size, we will still run the algorithm so that dummy entries are
++        # removed
++        # TODO: Test this
++        # if new_size < self._used:
++        #     raise RuntimeError('cannot shrink SimpleSet to something'
++        #                        ' smaller than the number of used slots.')
++        n_bytes = sizeof(PyObject*) * new_size;
++        new_table = <PyObject **>PyMem_Malloc(n_bytes)
++        if new_table == NULL:
++            raise MemoryError()
++
++        old_table = self._table
++        self._table = new_table
++        memset(self._table, 0, n_bytes)
++        self._mask = new_size - 1
++        self._used = 0
++        remaining = self._fill
++        self._fill = 0
++
++        # Moving everything to the other table is refcount neutral, so we don't
++        # worry about it.
++        slot = old_table
++        while remaining > 0:
++            if slot[0] == NULL: # unused slot
++                pass
++            elif slot[0] == _dummy: # dummy slot
++                remaining = remaining - 1
++            else: # active slot
++                remaining = remaining - 1
++                self._insert_clean(slot[0])
++            slot = slot + 1
++        PyMem_Free(old_table)
++        return new_size
++
++    def add(self, key):
++        """Similar to set.add(), start tracking this key.
++
++        There is one small difference, which is that we return the object that
++        is stored at the given location. (which is closer to the
++        dict.setdefault() functionality.)
++        """
++        return self._add(key)
++
++    cdef object _add(self, key):
++        cdef PyObject **slot, *py_key
++        cdef int added
++
++        py_key = <PyObject *>key
++        if (Py_TYPE(py_key).tp_richcompare == NULL
++            or Py_TYPE(py_key).tp_hash == NULL):
++            raise TypeError('Types added to SimpleSet must implement'
++                            ' both tp_richcompare and tp_hash')
++        added = 0
++        # We need at least one empty slot
++        assert self._used < self._mask
++        slot = _lookup(self, key)
++        if (slot[0] == NULL):
++            Py_INCREF(py_key)
++            self._fill = self._fill + 1
++            self._used = self._used + 1
++            slot[0] = py_key
++            added = 1
++        elif (slot[0] == _dummy):
++            Py_INCREF(py_key)
++            self._used = self._used + 1
++            slot[0] = py_key
++            added = 1
++        # No else: clause. If _lookup returns a pointer to
++        # a live object, then we already have a value at this location.
++        retval = <object>(slot[0])
++        # PySet and PyDict use a 2-3rds full algorithm, we'll follow suit
++        if added and (self._fill * 3) >= ((self._mask + 1) * 2):
++            # However, we always work for a load factor of 2:1
++            self._resize(self._used * 2)
++        # Even if we resized and ended up moving retval into a different slot,
++        # it is still the value that is held at the slot equivalent to 'key',
++        # so we can still return it
++        return retval
++
++    def discard(self, key):
++        """Remove key from the set, whether it exists or not.
++
++        :return: False if the item did not exist, True if it did
++        """
++        if self._discard(key):
++            return True
++        return False
++
++    cdef int _discard(self, key) except -1:
++        cdef PyObject **slot, *py_key
++
++        slot = _lookup(self, key)
++        if slot[0] == NULL or slot[0] == _dummy:
++            return 0
++        self._used = self._used - 1
++        Py_DECREF(slot[0])
++        slot[0] = _dummy
++        # PySet uses the heuristic: If more than 1/5 are dummies, then resize
++        #                           them away
++        #   if ((so->_fill - so->_used) * 5 < so->mask)
++        # However, we are planning on using this as an interning structure, in
++        # which we will be putting a lot of objects. And we expect that large
++        # groups of them are going to have the same lifetime.
++        # Dummy entries hurt a little bit because they cause the lookup to keep
++        # searching, but resizing is also rather expensive
++        # For now, we'll just use their algorithm, but we may want to revisit
++        # it
++        if ((self._fill - self._used) * 5 > self._mask):
++            self._resize(self._used * 2)
++        return 1
++
++    def __iter__(self):
++        return _SimpleSet_iterator(self)
++
++
++cdef class _SimpleSet_iterator:
++    """Iterator over the SimpleSet structure."""
++
++    cdef Py_ssize_t pos
++    cdef SimpleSet set
++    cdef Py_ssize_t _used # track if things have been mutated while iterating
++    cdef Py_ssize_t len # number of entries left
++
++    def __init__(self, obj):
++        self.set = obj
++        self.pos = 0
++        self._used = self.set._used
++        self.len = self.set._used
++
++    def __iter__(self):
++        return self
++
++    def __next__(self):
++        cdef Py_ssize_t mask, i
++        cdef PyObject *key
++
++        if self.set is None:
++            raise StopIteration
++        if self.set._used != self._used:
++            # Force this exception to continue to be raised
++            self._used = -1
++            raise RuntimeError("Set size changed during iteration")
++        if not SimpleSet_Next(self.set, &self.pos, &key):
++            self.set = None
++            raise StopIteration
++        # we found something
++        the_key = <object>key # INCREF
++        self.len = self.len - 1
++        return the_key
++
++    def __length_hint__(self):
++        if self.set is not None and self._used == self.set._used:
++            return self.len
++        return 0
++
++
++
++cdef api SimpleSet SimpleSet_New():
++    """Create a new SimpleSet object."""
++    return SimpleSet()
++
++
++cdef SimpleSet _check_self(object self):
++    """Check that the parameter is not None.
++
++    Pyrex/Cython will do type checking, but only to ensure that an object is
++    either the right type or None. You can say "object foo not None" for pure
++    python functions, but not for C functions.
++    So this is just a helper for all the apis that need to do the check.
++    """
++    cdef SimpleSet true_self
++    if self is None:
++        raise TypeError('self must not be None')
++    true_self = self
++    return true_self
++
++
++cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
++    """Find the slot where 'key' would fit.
++
++    This is the same as a dicts 'lookup' function.
++
++    :param key: An object we are looking up
++    :param hash: The hash for key
++    :return: The location in self.table where key should be put.
++        location == NULL is an exception, but (*location) == NULL just
++        indicates the slot is empty and can be used.
++    """
++    # This uses Quadratic Probing:
++    #  http://en.wikipedia.org/wiki/Quadratic_probing
++    # with c1 = c2 = 1/2
++    # This leads to probe locations at:
++    #  h0 = hash(k1)
++    #  h1 = h0 + 1
++    #  h2 = h0 + 3 = h1 + 1 + 1
++    #  h3 = h0 + 6 = h2 + 1 + 2
++    #  h4 = h0 + 10 = h2 + 1 + 3
++    # Note that all of these are '& mask', but that is computed *after* the
++    # offset.
++    # This differs from the algorithm used by Set and Dict. Which, effectively,
++    # use double-hashing, and a step size that starts large, but dwindles to
++    # stepping one-by-one.
++    # This gives more 'locality' in that if you have a collision at offset X,
++    # the first fallback is X+1, which is fast to check. However, that means
++    # that an object w/ hash X+1 will also check there, and then X+2 next.
++    # However, for objects with differing hashes, their chains are different.
++    # The former checks X, X+1, X+3, ... the latter checks X+1, X+2, X+4, ...
++    # So different hashes diverge quickly.
++    # A bigger problem is that we *only* ever use the lowest bits of the hash
++    # So all integers (x + SIZE*N) will resolve into the same bucket, and all
++    # use the same collision resolution. We may want to try to find a way to
++    # incorporate the upper bits of the hash with quadratic probing. (For
++    # example, X, X+1, X+3+some_upper_bits, X+6+more_upper_bits, etc.)
++    cdef size_t i, n_lookup
++    cdef Py_ssize_t mask
++    cdef long key_hash
++    cdef PyObject **table, **slot, *cur, **free_slot, *py_key
++
++    # hash is a signed long(), we are using an offset at unsigned size_t
++    key_hash = hash(key)
++    i = <size_t>key_hash
++    mask = self._mask
++    table = self._table
++    free_slot = NULL
++    py_key = <PyObject *>key
++    for n_lookup from 0 <= n_lookup <= <size_t>mask: # Don't loop forever
++        slot = &table[i & mask]
++        cur = slot[0]
++        if cur == NULL:
++            # Found a blank spot
++            if free_slot != NULL:
++                # Did we find an earlier _dummy entry?
++                return free_slot
++            else:
++                return slot
++        if cur == py_key:
++            # Found an exact pointer to the key
++            return slot
++        if cur == _dummy:
++            if free_slot == NULL:
++                free_slot = slot
++        elif _is_equal(py_key, key_hash, cur):
++            # Both py_key and cur belong in this slot, return it
++            return slot
++        i = i + 1 + n_lookup
++    raise AssertionError('should never get here')
++
++
++cdef api PyObject **_SimpleSet_Lookup(object self, object key) except NULL:
++    """Find the slot where 'key' would fit.
++
++    This is the same as a dicts 'lookup' function. This is a private
++    api because mutating what you get without maintaing the other invariants
++    is a 'bad thing'.
++
++    :param key: An object we are looking up
++    :param hash: The hash for key
++    :return: The location in self._table where key should be put
++        should never be NULL, but may reference a NULL (PyObject*)
++    """
++    return _lookup(_check_self(self), key)
++
++
++cdef api object SimpleSet_Add(object self, object key):
++    """Add a key to the SimpleSet (set).
++
++    :param self: The SimpleSet to add the key to.
++    :param key: The key to be added. If the key is already present,
++        self will not be modified
++    :return: The current key stored at the location defined by 'key'.
++        This may be the same object, or it may be an equivalent object.
++        (consider dict.setdefault(key, key))
++    """
++    return _check_self(self)._add(key)
++
++
++cdef api int SimpleSet_Contains(object self, object key) except -1:
++    """Is key present in self?"""
++    return (key in _check_self(self))
++
++
++cdef api int SimpleSet_Discard(object self, object key) except -1:
++    """Remove the object referenced at location 'key'.
++
++    :param self: The SimpleSet being modified
++    :param key: The key we are checking on
++    :return: 1 if there was an object present, 0 if there was not, and -1 on
++        error.
++    """
++    return _check_self(self)._discard(key)
++
++
++cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL:
++    """Get a pointer to the object present at location 'key'.
++
++    This returns an object which is equal to key which was previously added to
++    self. This returns a borrowed reference, as it may also return NULL if no
++    value is present at that location.
++
++    :param key: The value we are looking for
++    :return: The object present at that location
++    """
++    return _check_self(self)._get(key)
++
++
++cdef api Py_ssize_t SimpleSet_Size(object self) except -1:
++    """Get the number of active entries in 'self'"""
++    return _check_self(self)._used
++
++
++cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):
++    """Walk over items in a SimpleSet.
++
++    :param pos: should be initialized to 0 by the caller, and will be updated
++        by this function
++    :param key: Will return a borrowed reference to key
++    :return: 0 if nothing left, 1 if we are returning a new value
++    """
++    cdef Py_ssize_t i, mask
++    cdef SimpleSet true_self
++    cdef PyObject **table
++    true_self = _check_self(self)
++    i = pos[0]
++    if (i < 0):
++        return 0
++    mask = true_self._mask
++    table= true_self._table
++    while (i <= mask and (table[i] == NULL or table[i] == _dummy)):
++        i = i + 1
++    pos[0] = i + 1
++    if (i > mask):
++        return 0 # All done
++    if (key != NULL):
++        key[0] = table[i]
++    return 1
++
++
++cdef int SimpleSet_traverse(SimpleSet self, visitproc visit, void *arg):
++    """This is an implementation of 'tp_traverse' that hits the whole table.
++
++    Cython/Pyrex don't seem to let you define a tp_traverse, and they only
++    define one for you if you have an 'object' attribute. Since they don't
++    support C arrays of objects, we access the PyObject * directly.
++    """
++    cdef Py_ssize_t pos
++    cdef PyObject *next_key
++    cdef int ret
++
++    pos = 0
++    while SimpleSet_Next(self, &pos, &next_key):
++        ret = visit(next_key, arg)
++        if ret:
++            return ret
++    return 0
++
++# It is a little bit ugly to do this, but it works, and means that Meliae can
++# dump the total memory consumed by all child objects.
++(<PyTypeObject *>SimpleSet).tp_traverse = <traverseproc>SimpleSet_traverse
 === modified file 'bzrlib/python-compat.h'
 --- bzrlib/python-compat.h	2009-06-10 03:56:49 +0000
 +++ bzrlib/python-compat.h	2009-10-12 16:51:14 +0000
@@ -73,4 +73,9 @@
  #define  snprintf  _snprintf
  #endif
++/* Introduced in Python 2.6 */
++#ifndef Py_TYPE
++#  define Py_TYPE(o) ((o)->ob_type)
++#endif
++
  #endif /* _BZR_PYTHON_COMPAT_H */
 === modified file 'bzrlib/tests/__init__.py'
 --- bzrlib/tests/__init__.py	2009-10-08 01:50:30 +0000
 +++ bzrlib/tests/__init__.py	2009-10-12 16:51:14 +0000
@@ -3688,6 +3688,7 @@
          'bzrlib.tests.test__groupcompress',
          'bzrlib.tests.test__known_graph',
          'bzrlib.tests.test__rio',
++        'bzrlib.tests.test__simple_set',
          'bzrlib.tests.test__walkdirs_win32',
          'bzrlib.tests.test_ancestry',
          'bzrlib.tests.test_annotate',
 === added file 'bzrlib/tests/test__simple_set.py'
 --- bzrlib/tests/test__simple_set.py	1970-01-01 00:00:00 +0000
 +++ bzrlib/tests/test__simple_set.py	2009-10-12 16:51:14 +0000
@@ -0,0 +1,371 @@
++# Copyright (C) 2009 Canonical Ltd
++#
++# This program is free software; you can redistribute it and/or modify
++# it under the terms of the GNU General Public License as published by
++# the Free Software Foundation; either version 2 of the License, or
++# (at your option) any later version.
++#
++# This program is distributed in the hope that it will be useful,
++# but WITHOUT ANY WARRANTY; without even the implied warranty of
++# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++# GNU General Public License for more details.
++#
++# You should have received a copy of the GNU General Public License
++# along with this program; if not, write to the Free Software
++# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
++
++"""Tests for the StaticTupleInterned type."""
++
++import sys
++
++from bzrlib import (
++    errors,
++    osutils,
++    tests,
++    )
++
++try:
++    from bzrlib import _simple_set_pyx
++except ImportError:
++    _simple_set_pyx = None
++
++
++class _Hashable(object):
++    """A simple object which has a fixed hash value.
++
++    We could have used an 'int', but it turns out that Int objects don't
++    implement tp_richcompare...
++    """
++
++    def __init__(self, the_hash):
++        self.hash = the_hash
++
++    def __hash__(self):
++        return self.hash
++
++    def __eq__(self, other):
++        if not isinstance(other, _Hashable):
++            return NotImplemented
++        return other.hash == self.hash
++
++
++class _BadSecondHash(_Hashable):
++
++    def __init__(self, the_hash):
++        _Hashable.__init__(self, the_hash)
++        self._first = True
++
++    def __hash__(self):
++        if self._first:
++            self._first = False
++            return self.hash
++        else:
++            raise ValueError('I can only be hashed once.')
++
++
++class _BadCompare(_Hashable):
++
++    def __eq__(self, other):
++        raise RuntimeError('I refuse to play nice')
++
++
++# Even though this is an extension, we don't permute the tests for a python
++# version. As the plain python version is just a dict or set
++
++class _CompiledSimpleSet(tests.Feature):
++
++    def _probe(self):
++        if _simple_set_pyx is None:
++            return False
++        return True
++
++    def feature_name(self):
++        return 'bzrlib._simple_set_pyx'
++
++CompiledSimpleSet = _CompiledSimpleSet()
++
++
++class TestSimpleSet(tests.TestCase):
++
++    _test_needs_features = [CompiledSimpleSet]
++    module = _simple_set_pyx
++
++    def assertIn(self, obj, container):
++        self.assertTrue(obj in container,
++            '%s not found in %s' % (obj, container))
++
++    def assertNotIn(self, obj, container):
++        self.assertTrue(obj not in container,
++            'We found %s in %s' % (obj, container))
++
++    def assertFillState(self, used, fill, mask, obj):
++        self.assertEqual((used, fill, mask), (obj.used, obj.fill, obj.mask))
++
++    def assertLookup(self, offset, value, obj, key):
++        self.assertEqual((offset, value), obj._test_lookup(key))
++
++    def assertRefcount(self, count, obj):
++        """Assert that the refcount for obj is what we expect.
++
++        Note that this automatically adjusts for the fact that calling
++        assertRefcount actually creates a new pointer, as does calling
++        sys.getrefcount. So pass the expected value *before* the call.
++        """
++        # I'm not sure why the offset is 3, but I've check that in the caller,
++        # an offset of 1 works, which is expected. Not sure why assertRefcount
++        # is incrementing/decrementing 2 times
++        self.assertEqual(count, sys.getrefcount(obj)-3)
++
++    def test_initial(self):
++        obj = self.module.SimpleSet()
++        self.assertEqual(0, len(obj))
++        st = ('foo', 'bar')
++        self.assertFillState(0, 0, 0x3ff, obj)
++
++    def test__lookup(self):
++        # These are carefully chosen integers to force hash collisions in the
++        # algorithm, based on the initial set size of 1024
++        obj = self.module.SimpleSet()
++        self.assertLookup(643, '<null>', obj, _Hashable(643))
++        self.assertLookup(643, '<null>', obj, _Hashable(643 + 1024))
++        self.assertLookup(643, '<null>', obj, _Hashable(643 + 50*1024))
++
++    def test__lookup_collision(self):
++        obj = self.module.SimpleSet()
++        k1 = _Hashable(643)
++        k2 = _Hashable(643 + 1024)
++        self.assertLookup(643, '<null>', obj, k1)
++        self.assertLookup(643, '<null>', obj, k2)
++        obj.add(k1)
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(644, '<null>', obj, k2)
++
++    def test__lookup_after_resize(self):
++        obj = self.module.SimpleSet()
++        k1 = _Hashable(643)
++        k2 = _Hashable(643 + 1024)
++        obj.add(k1)
++        obj.add(k2)
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(644, k2, obj, k2)
++        obj._py_resize(2047) # resized to 2048
++        self.assertEqual(2048, obj.mask + 1)
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(643+1024, k2, obj, k2)
++        obj._py_resize(1023) # resized back to 1024
++        self.assertEqual(1024, obj.mask + 1)
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(644, k2, obj, k2)
++
++    def test_get_set_del_with_collisions(self):
++        obj = self.module.SimpleSet()
++
++        h1 = 643
++        h2 = 643 + 1024
++        h3 = 643 + 1024*50
++        h4 = 643 + 1024*25
++        h5 = 644
++        h6 = 644 + 1024
++
++        k1 = _Hashable(h1)
++        k2 = _Hashable(h2)
++        k3 = _Hashable(h3)
++        k4 = _Hashable(h4)
++        k5 = _Hashable(h5)
++        k6 = _Hashable(h6)
++        self.assertLookup(643, '<null>', obj, k1)
++        self.assertLookup(643, '<null>', obj, k2)
++        self.assertLookup(643, '<null>', obj, k3)
++        self.assertLookup(643, '<null>', obj, k4)
++        self.assertLookup(644, '<null>', obj, k5)
++        self.assertLookup(644, '<null>', obj, k6)
++        obj.add(k1)
++        self.assertIn(k1, obj)
++        self.assertNotIn(k2, obj)
++        self.assertNotIn(k3, obj)
++        self.assertNotIn(k4, obj)
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(644, '<null>', obj, k2)
++        self.assertLookup(644, '<null>', obj, k3)
++        self.assertLookup(644, '<null>', obj, k4)
++        self.assertLookup(644, '<null>', obj, k5)
++        self.assertLookup(644, '<null>', obj, k6)
++        self.assertIs(k1, obj[k1])
++        self.assertIs(k2, obj.add(k2))
++        self.assertIs(k2, obj[k2])
++        self.assertLookup(643, k1, obj, k1)
++        self.assertLookup(644, k2, obj, k2)
++        self.assertLookup(646, '<null>', obj, k3)
++        self.assertLookup(646, '<null>', obj, k4)
++        self.assertLookup(645, '<null>', obj, k5)
++        self.assertLookup(645, '<null>', obj, k6)
++        self.assertLookup(643, k1, obj, _Hashable(h1))
++        self.assertLookup(644, k2, obj, _Hashable(h2))
++        self.assertLookup(646, '<null>', obj, _Hashable(h3))
++        self.assertLookup(646, '<null>', obj, _Hashable(h4))
++        self.assertLookup(645, '<null>', obj, _Hashable(h5))
++        self.assertLookup(645, '<null>', obj, _Hashable(h6))
++        obj.add(k3)
++        self.assertIs(k3, obj[k3])
++        self.assertIn(k1, obj)
++        self.assertIn(k2, obj)
++        self.assertIn(k3, obj)
++        self.assertNotIn(k4, obj)
++
++        obj.discard(k1)
++        self.assertLookup(643, '<dummy>', obj, k1)
++        self.assertLookup(644, k2, obj, k2)
++        self.assertLookup(646, k3, obj, k3)
++        self.assertLookup(643, '<dummy>', obj, k4)
++        self.assertNotIn(k1, obj)
++        self.assertIn(k2, obj)
++        self.assertIn(k3, obj)
++        self.assertNotIn(k4, obj)
++
++    def test_add(self):
++        obj = self.module.SimpleSet()
++        self.assertFillState(0, 0, 0x3ff, obj)
++        # We use this clumsy notation, because otherwise the refcounts are off.
++        # I'm guessing the python compiler sees it is a static tuple, and adds
++        # it to the function variables, or somesuch
++        k1 = tuple(['foo'])
++        self.assertRefcount(1, k1)
++        self.assertIs(k1, obj.add(k1))
++        self.assertFillState(1, 1, 0x3ff, obj)
++        self.assertRefcount(2, k1)
++        ktest = obj[k1]
++        self.assertRefcount(3, k1)
++        self.assertIs(k1, ktest)
++        del ktest
++        self.assertRefcount(2, k1)
++        k2 = tuple(['foo'])
++        self.assertRefcount(1, k2)
++        self.assertIsNot(k1, k2)
++        # doesn't add anything, so the counters shouldn't be adjusted
++        self.assertIs(k1, obj.add(k2))
++        self.assertFillState(1, 1, 0x3ff, obj)
++        self.assertRefcount(2, k1) # not changed
++        self.assertRefcount(1, k2) # not incremented
++        self.assertIs(k1, obj[k1])
++        self.assertIs(k1, obj[k2])
++        self.assertRefcount(2, k1)
++        self.assertRefcount(1, k2)
++        # Deleting an entry should remove the fill, but not the used
++        obj.discard(k1)
++        self.assertFillState(0, 1, 0x3ff, obj)
++        self.assertRefcount(1, k1)
++        k3 = tuple(['bar'])
++        self.assertRefcount(1, k3)
++        self.assertIs(k3, obj.add(k3))
++        self.assertFillState(1, 2, 0x3ff, obj)
++        self.assertRefcount(2, k3)
++        self.assertIs(k2, obj.add(k2))
++        self.assertFillState(2, 2, 0x3ff, obj)
++        self.assertRefcount(1, k1)
++        self.assertRefcount(2, k2)
++        self.assertRefcount(2, k3)
++
++    def test_discard(self):
++        obj = self.module.SimpleSet()
++        k1 = tuple(['foo'])
++        k2 = tuple(['foo'])
++        k3 = tuple(['bar'])
++        self.assertRefcount(1, k1)
++        self.assertRefcount(1, k2)
++        self.assertRefcount(1, k3)
++        obj.add(k1)
++        self.assertRefcount(2, k1)
++        self.assertEqual(0, obj.discard(k3))
++        self.assertRefcount(1, k3)
++        obj.add(k3)
++        self.assertRefcount(2, k3)
++        self.assertEqual(1, obj.discard(k3))
++        self.assertRefcount(1, k3)
++
++    def test__resize(self):
++        obj = self.module.SimpleSet()
++        k1 = ('foo',)
++        k2 = ('bar',)
++        k3 = ('baz',)
++        obj.add(k1)
++        obj.add(k2)
++        obj.add(k3)
++        obj.discard(k2)
++        self.assertFillState(2, 3, 0x3ff, obj)
++        self.assertEqual(1024, obj._py_resize(500))
++        # Doesn't change the size, but does change the content
++        self.assertFillState(2, 2, 0x3ff, obj)
++        obj.add(k2)
++        obj.discard(k3)
++        self.assertFillState(2, 3, 0x3ff, obj)
++        self.assertEqual(4096, obj._py_resize(4095))
++        self.assertFillState(2, 2, 0xfff, obj)
++        self.assertIn(k1, obj)
++        self.assertIn(k2, obj)
++        self.assertNotIn(k3, obj)
++        obj.add(k2)
++        self.assertIn(k2, obj)
++        obj.discard(k2)
++        self.assertEqual((591, '<dummy>'), obj._test_lookup(k2))
++        self.assertFillState(1, 2, 0xfff, obj)
++        self.assertEqual(2048, obj._py_resize(1024))
++        self.assertFillState(1, 1, 0x7ff, obj)
++        self.assertEqual((591, '<null>'), obj._test_lookup(k2))
++
++    def test_second_hash_failure(self):
++        obj = self.module.SimpleSet()
++        k1 = _BadSecondHash(200)
++        k2 = _Hashable(200)
++        # Should only call hash() one time
++        obj.add(k1)
++        self.assertFalse(k1._first)
++        self.assertRaises(ValueError, obj.add, k2)
++
++    def test_richcompare_failure(self):
++        obj = self.module.SimpleSet()
++        k1 = _Hashable(200)
++        k2 = _BadCompare(200)
++        obj.add(k1)
++        # Tries to compare with k1, fails
++        self.assertRaises(RuntimeError, obj.add, k2)
++
++    def test_add_and_remove_lots_of_items(self):
++        obj = self.module.SimpleSet()
++        chars = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890'
++        for i in chars:
++            for j in chars:
++                k = (i, j)
++                obj.add(k)
++        num = len(chars)*len(chars)
++        self.assertFillState(num, num, 0x1fff, obj)
++        # Now delete all of the entries and it should shrink again
++        for i in chars:
++            for j in chars:
++                k = (i, j)
++                obj.discard(k)
++        # It should be back to 1024 wide mask, though there may still be some
++        # dummy values in there
++        self.assertFillState(0, obj.fill, 0x3ff, obj)
++        # but there should be fewer than 1/5th dummy entries
++        self.assertTrue(obj.fill < 1024 / 5)
++
++    def test__iter__(self):
++        obj = self.module.SimpleSet()
++        k1 = ('1',)
++        k2 = ('1', '2')
++        k3 = ('3', '4')
++        obj.add(k1)
++        obj.add(k2)
++        obj.add(k3)
++        all = set()
++        for key in obj:
++            all.add(key)
++        self.assertEqual(sorted([k1, k2, k3]), sorted(all))
++        iterator = iter(obj)
++        iterator.next()
++        obj.add(('foo',))
++        # Set changed size
++        self.assertRaises(RuntimeError, iterator.next)
++        # And even removing an item still causes it to fail
++        obj.discard(k2)
++        self.assertRaises(RuntimeError, iterator.next)
 === modified file 'setup.py'
 --- setup.py	2009-10-01 03:46:41 +0000
 +++ setup.py	2009-10-12 16:51:14 +0000
@@ -300,6 +300,7 @@
  add_pyrex_extension('bzrlib._chk_map_pyx', libraries=[z_lib])
  ext_modules.append(Extension('bzrlib._patiencediff_c',
                               ['bzrlib/_patiencediff_c.c']))
++add_pyrex_extension('bzrlib._simple_set_pyx')
  if unavailable_files: