I'm very excited by this patch series!

This one is:

 review approve

Although there are some comments you should look at, especially regarding
licensing...

John A Meinel wrote:
[...]
> === added file 'bzrlib/_simple_set_pyx.pxd'
[...]
> +"""Interface definition of a class like PySet but without caching the hash.
> +
> +This is generally useful when you want to 'intern' objects, etc. Note that this
> +differs from Set in that we:
> +  1) Don't have all of the .intersection, .difference, etc functions
> +  2) Do return the object from the set via queries
> +     eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
> +"""

I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
It's a slightly unusual mix of dict-like and set-like APIs.  I think it's fine
after some thinking about it, I'm just noting that it seems a bit funny at
first.  Although maybe .remove would be more set-like than .__delitem__?

> +cdef api object SimpleSet_Add(object self, object key)
> +cdef api SimpleSet SimpleSet_New()
> +cdef api object SimpleSet_Add(object self, object key)

That appears to be a duplicate declaration of SimpleSet_Add.

> +cdef api int SimpleSet_Contains(object self, object key) except -1
> +cdef api int SimpleSet_Discard(object self, object key) except -1
> +cdef api PyObject *SimpleSet_Get(SimpleSet self, object key) except? NULL
> +cdef api Py_ssize_t SimpleSet_Size(object self) except -1
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key)
> 
> === added file 'bzrlib/_simple_set_pyx.pyx'
[...]
> +cdef object _dummy_obj
> +cdef PyObject *_dummy
> +_dummy_obj = object()
> +_dummy = <PyObject *>_dummy_obj

It's not very clear what _dummy is used for.  I guess what's missing is some
text giving an overview of the data structure, although I suppose that's what
you mean to convey by the docstrings saying it is similar to the builtin
dict/set types.

Anyway, it appears _dummy is used to avoid resizing/compacting the
table for every single discard.  [Part of why it would be nice to have some text
describing the data structure is so that there's some clear terminology to
use... the code pretty clearly talks about “tables” and “slots”, which seem
fairly clear, but then what's an “entry”?]

> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> +    cdef long other_hash
> +    cdef PyObject *res
> +
> +    if this == other:
> +        return 1
> +    other_hash = Py_TYPE(other).tp_hash(other)

What happens if 'other' is not hashable?

> +    if other_hash != this_hash:
> +        return 0
> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)

Similarly, what if one the richcompare calls raise an exception?

> +cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
[...]
> +    cdef int _insert_clean(self, PyObject *key) except -1:
> +        """Insert a key into self.table.
> +
> +        This is only meant to be used during times like '_resize',
> +        as it makes a lot of assuptions about keys not already being present,
> +        and there being no dummy entries.
> +        """
> +        cdef size_t i, perturb, mask
> +        cdef long the_hash
> +        cdef PyObject **table, **entry
> +
> +        mask = self._mask
> +        table = self._table
> +
> +        the_hash = Py_TYPE(key).tp_hash(key)
> +        i = the_hash & mask
> +        entry = &table[i]
> +        perturb = the_hash
> +        # Because we know that we made all items unique before, we can just
> +        # iterate as long as the target location is not empty, we don't have to
> +        # do any comparison, etc.
> +        while entry[0] != NULL:
> +            i = (i << 2) + i + perturb + 1
> +            entry = &table[i & mask]
> +            perturb >>= PERTURB_SHIFT

This is code borrowed from Python's dictobject.c/setobject.c, I think?  I think
you need to:

 - point to the Python source, which explains this logic
 - and maybe adjust the copyright & licence declaration.  I'm no expert, but
   maybe we shouldn't be claiming sole ownership of this code... maybe put the
   key parts like this that are fairly directly taken from Python (assuming I'm
   right about that!) into a separate file, so we keep it clear which code is
   solely ours, and which is mixed.  Possibly that's just this function and the
   _lookup function?

> +    cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
[...]
> +        # We rolled over our signed size field
> +        if new_size <= 0:
> +            raise MemoryError()

It's good to see code being so pedantic :)

> +    def discard(self, key):
> +        """Remove key from the dict, whether it exists or not.
> +
> +        :return: 0 if the item did not exist, 1 if it did
> +        """
> +        return self._discard(key)

It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...

> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
> +    """Find the slot where 'key' would fit.
> +
> +    This is the same as a dicts 'lookup' function.
> +
> +    :param key: An object we are looking up
> +    :param hash: The hash for key
> +    :return: The location in self.table where key should be put
> +        should never be NULL, but may reference a NULL (PyObject*)
> +    """

I don't quite follow the “should never be NULL, but may reference a NULL” part
of that docstring.

[...]
> +    if cur[0] == NULL:
> +        # Found a blank spot, or found the exact key
> +        return cur
> +    if cur[0] == py_key:
> +        return cur
> +    if cur[0] == _dummy:
> +        free_slot = cur
> +    else:
> +        if _is_equal(py_key, key_hash, cur[0]):
> +            # Both py_key and cur[0] belong in this slot, return it
> +            return cur
> +        free_slot = NULL
> +    # size_t is unsigned, hash is signed...
> +    perturb = key_hash
> +    while True:
> +        i = (i << 2) + i + perturb + 1
> +        cur = &table[i & mask]
> +        if cur[0] == NULL: # Found an empty spot
> +            if free_slot: # Did we find a _dummy earlier?
> +                return free_slot
> +            else:
> +                return cur
> +        if (cur[0] == py_key # exact match
> +            or _is_equal(py_key, key_hash, cur[0])): # Equivalent match
> +            return cur
> +        if (cur[0] == _dummy and free_slot == NULL):
> +            free_slot = cur
> +        perturb >>= PERTURB_SHIFT
> +    raise AssertionError('should never get here')

Hmm, this appears to be essentially:

<loop body>
while True:
    <increment stuff>
    <loop body>

Why can't it be:

while True:
    <loop body>
    <increment stuff>

?

> +# TODO: this should probably have direct tests, since it isn't used by __iter__
> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):

I agree with that TODO ;)

Although getting rid of the duplication between this and
_SimpleSet_iterator.__next__ would be even nicer.

[...]
> +cdef int SimpleSet_traverse(SimpleSet self, visitproc visit, void *arg):
> +    """This is an implementation of 'tp_traverse' that hits the whole table.
> +
> +    Cython/Pyrex don't seem to let you define a tp_traverse, and they only
> +    define one for you if you have an 'object' attribute. Since they don't
> +    support C arrays of objects, we access the PyObject * directly.
> +    """
> +    cdef Py_ssize_t pos
> +    cdef PyObject *next_key
> +    cdef int ret
> +
> +    pos = 0
> +    while SimpleSet_Next(self, &pos, &next_key):
> +        ret = visit(next_key, arg)
> +        if ret:
> +            return ret
> +
> +    return 0;

You don't need the semi-colon ;)

I wonder if there's any risk that an object referenced by a SimpleSet might
somehow mutate that SimpleSet during this traverse?

> === added file 'bzrlib/tests/test__simple_set.py'
[...]
> +    def assertIn(self, obj, container):
> +        self.assertTrue(obj in container,
> +            '%s not found in %s' % (obj, container))
> +
> +    def assertNotIn(self, obj, container):
> +        self.assertTrue(obj not in container,
> +            'We found %s in %s' % (obj, container))

I was a bit surprised that we didn't seem to already have these!

> +    def assertRefcount(self, count, obj):
> +        """Assert that the refcount for obj is what we expect.
> +
> +        Note that this automatically adjusts for the fact that calling
> +        assertRefcount actually creates a new pointer, as does calling
> +        sys.getrefcount. So pass the expected value *before* the call.
> +        """
> +        # I'm not sure why the offset is 3, but I've check that in the caller,
> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
> +        # is incrementing/decrementing 2 times
> +        self.assertEqual(count, sys.getrefcount(obj)-3)

Eep.  This is going to be pretty fragile across CPython versions, let alone
non-CPython Pythons...

Can you perhaps use weakrefs instead?  They're enough to let you observe object
lifetimes (either via a callback, or just seeing that the weakref cannot be
resolved), although for portability you still need to call gc.collect() before
you can reliably know if the object is gone or not...

I guess we can worry about disabling the tests that need this on other Pythons
if/when we run on them...  (and of course we may not care about this code much
on non-CPython!).

> +    def test__lookup(self):
> +        # The tuple hash function is rather good at entropy. For all integers
> +        # 0=>1023, hash((i,)) & 1023 maps to a unique output, and hash((i,j))
> +        # maps to all 1024 fields evenly.
> +        # However, hash((c,d))& 1023 for characters has an uneven distribution
> +        # of collisions, for example:
> +        #  ('a', 'a'), ('f', '4'), ('p', 'r'), ('q', '1'), ('F', 'T'),
> +        #  ('Q', 'Q'), ('V', 'd'), ('7', 'C')
> +        # all collide @ 643
> +        obj = self.module.SimpleSet()
> +        offset, val = obj._test_lookup(('a', 'a'))
> +        self.assertEqual(643, offset)
[...]

Hmm, again pretty fragile due to depending on Python internals.  Oh well.

-Andrew.