-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
...

>> +  1) Don't have all of the .intersection, .difference, etc functions
>> +  2) Do return the object from the set via queries
>> +     eg. SimpleSet.add(key) => saved_key and SimpleSet[key] => saved_key
>> +"""
> 
> I feel a bit unsure about a type that has both add and __getitem__/__delitem__.
> It's a slightly unusual mix of dict-like and set-like APIs.  I think it's fine
> after some thinking about it, I'm just noting that it seems a bit funny at
> first.  Although maybe .remove would be more set-like than .__delitem__?

So at the moment, the only apis I need for interning are:
  add() and discard()

I could certainly remove the __getitem__ and __delitem__ if you are more
comfortable with it.

I want to have an api that fits what we need/want to use. As such, it
will probably be driven by actual use cases. And so far, there is only 1
of those...

I considered changing the semantics, such that "__getitem__" == "add()".
The main reason for that is because it avoids the getattr() overhead, as
__getitem__ is a slot in the type struct (tp_item) while .add() requires
an attribute lookup.

Which basically means that in an interning loop you would have:

for bar in group:
  bar = dedup[bar]

rather than

  bar = dedup.add(bar)

or the old form

  bar = dict.setdefault(bar, bar)


I would guess that doing

add = dedup.add
for bar in group:
  bar = add(bar)

Is going to perform the same as the dedup[bar] form. But both should do
better than setdefault. (At a minimum, setdefault takes 2 parameters,
and thus has to create a tuple, etc.)

I don't know if the interpreter has further internal benefits to using
the __getitem__ form, since it is a known api that has to conform in
certain ways.

The big concern is that "dedup[bar]" can mutate 'dedup' and that is
potentially unexpected.


> 
>> +cdef api object SimpleSet_Add(object self, object key)
>> +cdef api SimpleSet SimpleSet_New()
>> +cdef api object SimpleSet_Add(object self, object key)
> 
> That appears to be a duplicate declaration of SimpleSet_Add.

Thanks.


...

> [...]
>> +cdef object _dummy_obj
>> +cdef PyObject *_dummy
>> +_dummy_obj = object()
>> +_dummy = <PyObject *>_dummy_obj
> 
> It's not very clear what _dummy is used for.  I guess what's missing is some
> text giving an overview of the data structure, although I suppose that's what
> you mean to convey by the docstrings saying it is similar to the builtin
> dict/set types.
> 
> Anyway, it appears _dummy is used to avoid resizing/compacting the
> table for every single discard.  [Part of why it would be nice to have some text
> describing the data structure is so that there's some clear terminology to
> use... the code pretty clearly talks about “tables” and “slots”, which seem
> fairly clear, but then what's an “entry”?]


How's this:

 # Data structure definition:
 #   This is a basic hash table using open addressing.
 #       http://en.wikipedia.org/wiki/Open_addressing
 #   Basically that means we keep an array of pointers to Python objects
 #   (called a table). Each location in the array is called a 'slot'.
 #
 #   An empty slot holds a NULL pointer, a slot where there was an item
 #   which was then deleted will hold a pointer to _dummy, and a filled slot
 #   points at the actual object which fills that slot.
 #
 #   The table is always a power of two, and the default location where an
 #   object is inserted is at hash(object) & (table_size - 1)
 #
 #   If there is a collision, then we search for another location. The
 #   specific algorithm is in _lookup. We search until we:
 #       find the object
 #       find an equivalent object (by tp_richcompare(obj1, obj2, Py_EQ))
 #       find a NULL slot
 #
 #   When an object is deleted, we set its slot to _dummy. this way we don't
 #   have to track whether there was a collision, and find the corresponding
 #   keys. (The collision resolution algorithm makes that nearly impossible
 #   anyway, because it depends on the upper bits of the hash.)
 #   The main effect of this, is that if we find _dummy, then we can insert
 #   an object there, but we have to keep searching until we find NULL to
 #   know that the object is not present elsewhere.


> 
>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>> +    cdef long other_hash
>> +    cdef PyObject *res
>> +
>> +    if this == other:
>> +        return 1
>> +    other_hash = Py_TYPE(other).tp_hash(other)
> 
> What happens if 'other' is not hashable?

Other will always be something that is already held in the internal
structure, and thus has been hashed. 'this' has also already been hashed
as part of inserting, and thus we've also checked that it can be hashed.

I can change this to PyObject_Hash() if you feel strongly. I was
avoiding a function call overhead, though it probably doesn't have a
huge impact on performance.

> 
>> +    if other_hash != this_hash:
>> +        return 0
>> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
> 
> Similarly, what if one the richcompare calls raise an exception?

I'll admit that I took a few shortcuts, as I was originally thinking
that _table would be:

 StaticTuple **_table

rather than a generic object. So I can...

1) Define _is_equal to 'except -1' and have Pyrex do exception checking
   on every query. And then I can check 'if res == NULL: return -1'.

2) Arguably doing "res == Py_True" isn't strictly correct either, and I
   should be using PyObject_IsTrue() here. However, the more generic we
   make this code, the slower it performs, and we have to ask whether it
   is worth being generic given that it is meant to be used with
   specific types...

3) I would consider going so far as to restrict this class to only
   supporting one type of object. (First object inserted sets the type,
   or it is set via the constructor.)

I'd like to avoid making the common case slow because of very unlikely
potential edge cases. But I'm willing to discuss where an appropriate
tradeoff is.


> 
>> +cdef public api class SimpleSet [object SimpleSetObject, type SimpleSet_Type]:
> [...]
>> +    cdef int _insert_clean(self, PyObject *key) except -1:
>> +        """Insert a key into self.table.
>> +
>> +        This is only meant to be used during times like '_resize',
>> +        as it makes a lot of assuptions about keys not already being present,
>> +        and there being no dummy entries.
>> +        """
>> +        cdef size_t i, perturb, mask
>> +        cdef long the_hash
>> +        cdef PyObject **table, **entry
>> +
>> +        mask = self._mask
>> +        table = self._table
>> +
>> +        the_hash = Py_TYPE(key).tp_hash(key)
>> +        i = the_hash & mask
>> +        entry = &table[i]
>> +        perturb = the_hash
>> +        # Because we know that we made all items unique before, we can just
>> +        # iterate as long as the target location is not empty, we don't have to
>> +        # do any comparison, etc.
>> +        while entry[0] != NULL:
>> +            i = (i << 2) + i + perturb + 1
>> +            entry = &table[i & mask]
>> +            perturb >>= PERTURB_SHIFT
> 
> This is code borrowed from Python's dictobject.c/setobject.c, I think?  I think
> you need to:
> 
>  - point to the Python source, which explains this logic
>  - and maybe adjust the copyright & licence declaration.  I'm no expert, but
>    maybe we shouldn't be claiming sole ownership of this code... maybe put the
>    key parts like this that are fairly directly taken from Python (assuming I'm
>    right about that!) into a separate file, so we keep it clear which code is
>    solely ours, and which is mixed.  Possibly that's just this function and the
>    _lookup function?

The only bit that is close to copyrightable is:
>> +            i = (i << 2) + i + perturb + 1
>> +            entry = &table[i & mask]
>> +            perturb >>= PERTURB_SHIFT

Since it is an explicit combination of parameters. I'm happy to just
change it to something like:
 http://en.wikipedia.org/wiki/Quadratic_probing

Which is: trivial to implement, just a different set of performance
tradeoffs (you gain locality, at an increased chance for collisions),
and certainly a public domain solution that would avoid any copyright
issues.

> 
>> +    cdef Py_ssize_t _resize(self, Py_ssize_t min_used) except -1:
> [...]
>> +        # We rolled over our signed size field
>> +        if new_size <= 0:
>> +            raise MemoryError()
> 
> It's good to see code being so pedantic :)

new_size is a signed integer, many things allocate using an unsigned
integer, so it *can* happen. But yeah, I didn't think of it myself.

> 
>> +    def discard(self, key):
>> +        """Remove key from the dict, whether it exists or not.
>> +
>> +        :return: 0 if the item did not exist, 1 if it did
>> +        """
>> +        return self._discard(key)
> 
> It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
> that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...
>

Originally I was using:

cpdef bint discard(self, key):
  ...

Which is Cython code that defines both a Python version and a C version,
and casts integers into Booleans. I can certainly just do:

if self._discard(key);
  return True
return False


>> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
>> +    """Find the slot where 'key' would fit.
>> +
>> +    This is the same as a dicts 'lookup' function.
>> +
>> +    :param key: An object we are looking up
>> +    :param hash: The hash for key
>> +    :return: The location in self.table where key should be put
>> +        should never be NULL, but may reference a NULL (PyObject*)
>> +    """
> 
> I don't quite follow the “should never be NULL, but may reference a NULL” part
> of that docstring.
> 

I shouldn't be using the term 'dict' here either.

Anyway, I'm returning a **. I'll try to clarify. But in essence:

PyObject **result = _lookup(...);

assert(result != NULL)
if (*result) == NULL: # nothing is stored here


> [...]
>> +    if cur[0] == NULL:
>> +        # Found a blank spot, or found the exact key
>> +        return cur
>> +    if cur[0] == py_key:
>> +        return cur
>> +    if cur[0] == _dummy:
>> +        free_slot = cur
>> +    else:
>> +        if _is_equal(py_key, key_hash, cur[0]):
>> +            # Both py_key and cur[0] belong in this slot, return it
>> +            return cur
>> +        free_slot = NULL
>> +    # size_t is unsigned, hash is signed...
>> +    perturb = key_hash
>> +    while True:
>> +        i = (i << 2) + i + perturb + 1
>> +        cur = &table[i & mask]
>> +        if cur[0] == NULL: # Found an empty spot
>> +            if free_slot: # Did we find a _dummy earlier?
>> +                return free_slot
>> +            else:
>> +                return cur
>> +        if (cur[0] == py_key # exact match
>> +            or _is_equal(py_key, key_hash, cur[0])): # Equivalent match
>> +            return cur
>> +        if (cur[0] == _dummy and free_slot == NULL):
>> +            free_slot = cur
>> +        perturb >>= PERTURB_SHIFT
>> +    raise AssertionError('should never get here')
> 
> Hmm, this appears to be essentially:
> 
> <loop body>
> while True:
>     <increment stuff>
>     <loop body>
> 
> Why can't it be:
> 
> while True:
>     <loop body>
>     <increment stuff>

So, there are a few subtle differences, but I'm not sure if it is worth
the code duplication. Namely

1) The first lookup is going to be the right location 90+% of the time.
(if things are sized correctly, and your hash is working well, etc.) So
you want to avoid any overhead that you can.

2) One of that is checking whether free_slot is NULL or not.

3) The loop code changes the order of checking for _dummy with checking
   _is_equal, because in the claim is that _dummy rare enough to offset
   the speed of the check with the speed of the comparison. I guess I'm
   doubtful, too.


> 
> ?
> 
>> +# TODO: this should probably have direct tests, since it isn't used by __iter__
>> +cdef api int SimpleSet_Next(object self, Py_ssize_t *pos, PyObject **key):
> 
> I agree with that TODO ;)
> 
> Although getting rid of the duplication between this and
> _SimpleSet_iterator.__next__ would be even nicer.

It costs a function call per item, but __next__ is already a python
function call, so I'll just do it. We don't really list(set) often
anyway. (Current use cases *never* do so...)


...

> You don't need the semi-colon ;)
> 
> I wonder if there's any risk that an object referenced by a SimpleSet might
> somehow mutate that SimpleSet during this traverse?

This is the same basic implementation that is used by dicts and sets.
Namely, tp_traverse is used by the garbage collector, to 'doing stuff'
is supposed to be forbidden (I assume).


> 
>> +    def assertRefcount(self, count, obj):
>> +        """Assert that the refcount for obj is what we expect.
>> +
>> +        Note that this automatically adjusts for the fact that calling
>> +        assertRefcount actually creates a new pointer, as does calling
>> +        sys.getrefcount. So pass the expected value *before* the call.
>> +        """
>> +        # I'm not sure why the offset is 3, but I've check that in the caller,
>> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
>> +        # is incrementing/decrementing 2 times
>> +        self.assertEqual(count, sys.getrefcount(obj)-3)
> 
> Eep.  This is going to be pretty fragile across CPython versions, let alone
> non-CPython Pythons...

Do you really think that refcounts are going to be fragile across
CPython versions?


> 
> Can you perhaps use weakrefs instead?  They're enough to let you observe object
> lifetimes (either via a callback, or just seeing that the weakref cannot be
> resolved), although for portability you still need to call gc.collect() before
> you can reliably know if the object is gone or not...

tuples can't have weakrefs. I really *do* want to assert that my
refcounts are correct. And Pyrex code is *not* portable to anything but
CPython.


> 
> I guess we can worry about disabling the tests that need this on other Pythons
> if/when we run on them...  (and of course we may not care about this code much
> on non-CPython!).
> 

Exactly. You can't write a C extension to Jython or IronPython...


>> +    def test__lookup(self):
>> +        # The tuple hash function is rather good at entropy. For all integers
>> +        # 0=>1023, hash((i,)) & 1023 maps to a unique output, and hash((i,j))
>> +        # maps to all 1024 fields evenly.
>> +        # However, hash((c,d))& 1023 for characters has an uneven distribution
>> +        # of collisions, for example:
>> +        #  ('a', 'a'), ('f', '4'), ('p', 'r'), ('q', '1'), ('F', 'T'),
>> +        #  ('Q', 'Q'), ('V', 'd'), ('7', 'C')
>> +        # all collide @ 643
>> +        obj = self.module.SimpleSet()
>> +        offset, val = obj._test_lookup(('a', 'a'))
>> +        self.assertEqual(643, offset)
> [...]
> 
> Hmm, again pretty fragile due to depending on Python internals.  Oh well.
> 
> -Andrew.

If you feel it would be less fragile, I can change that to use the
integers corresponding to hash('a') or hash('a', 'a'). I really wanted
to make sure things were working with StaticTuple, which uses the hash
algorithm from tuple itself.

This code originally used all StaticTuple instances, until I refactored
it for review. And as I mentioned, it originally wasn't going to allow
arbitrary objects.

However, since it now *does*, I could just use a bunch of ints. It still
depends on hash() details, which is python implementation specific... I
could write my own class that returns __hash__ that I want, but it seems
a bit YAGNI.

I'll simplify it, though, if it makes you feel more comfortable.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrPZx4ACgkQJdeBCYSNAAMJEwCfZDUETOQoPjySDGKLhoPRa9Sn
NMkAn2ttsmaFvkrJxzYSkzuuEUdW5bgh
=jPzP
-----END PGP SIGNATURE-----