John A Meinel wrote:
[...]
> I would guess that doing
> 
> add = dedup.add
> for bar in group:
>   bar = add(bar)
> 
> Is going to perform the same as the dedup[bar] form. But both should do
> better than setdefault. (At a minimum, setdefault takes 2 parameters,
> and thus has to create a tuple, etc.)
> 
> I don't know if the interpreter has further internal benefits to using
> the __getitem__ form, since it is a known api that has to conform in
> certain ways.

Right, I think you'd still see a bit of benefit in the __getitem__ form because
Python knows that that operator only takes one arg and so avoids packing and
unpacking an args tuple for it.  (Like METH_O vs. METH_VARARGS.)  ISTR some part
of the zope.interface C extension intentionally abuses an obscure no-arg
operator (maybe pos?) as that's the fastest way to invoke C code from Python.

> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
> potentially unexpected.

Agreed.  For readability my preference is .add, but obviously sufficiently large
performance benefits may override that.  I don't *think* the boost from using
__getitem__ instead would be that much greater for this use, but I haven't
measured so I may be very wrong :)

[...]
> How's this:
> 
>  # Data structure definition:

Great!

> >> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
> >> +    cdef long other_hash
> >> +    cdef PyObject *res
> >> +
> >> +    if this == other:
> >> +        return 1
> >> +    other_hash = Py_TYPE(other).tp_hash(other)
> > 
> > What happens if 'other' is not hashable?
> 
> Other will always be something that is already held in the internal
> structure, and thus has been hashed. 'this' has also already been hashed
> as part of inserting, and thus we've also checked that it can be hashed.
> 
> I can change this to PyObject_Hash() if you feel strongly. I was
> avoiding a function call overhead, though it probably doesn't have a
> huge impact on performance.

Well, badly-behaved objects might successfully give a hash value the first time
and then get an error later.  I do strongly lean towards paranoia in C code, so
I think PyObject_Hash is probably safest.  You ought to be able to construct a
test that defines such a badly-behaved object to see just how bad the fallout
is.  If segfaults are possible then definitely close that hole!

Also, PyObject_Hash is where Python implements the logic that an object with no
tp_hash in its type will use hash(id(obj)), so long as it doesn't have
tp_compare or tp_richcompare.  I think this is pretty common, e.g. the module
type has no tp_hash.  And given that you use the 'hash' global elsewhere in this
module there's definitely some chance for confusion here.

But for the case you're optimising for PyObject_Hash will just call the object's
tp_hash immediately and return the result, so it should be a minimal penalty,
just a C function call overhead.

The thing I was really looking for though was checking the result value of
tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

> >> +    if other_hash != this_hash:
> >> +        return 0
> >> +    res = Py_TYPE(this).tp_richcompare(this, other, Py_EQ)
> > 
> > Similarly, what if one the richcompare calls raise an exception?
> 
> I'll admit that I took a few shortcuts, as I was originally thinking
> that _table would be:
> 
>  StaticTuple **_table
> 
> rather than a generic object. So I can...
> 
> 1) Define _is_equal to 'except -1' and have Pyrex do exception checking
>    on every query. And then I can check 'if res == NULL: return -1'.

I think this is a must, and should be pretty cheap.

> 2) Arguably doing "res == Py_True" isn't strictly correct either, and I
>    should be using PyObject_IsTrue() here. However, the more generic we
>    make this code, the slower it performs, and we have to ask whether it
>    is worth being generic given that it is meant to be used with
>    specific types...

Well, if you want to restrict it to particular types, make the code reject them
from being added.  Otherwise someone will use this code in unexpected ways
sooner or later :)

Again, PyObject_IsTrue tries == Py_True as the very first thing, so the cost
should be small.  If you're worried a macro along the lines of:

#define PY_IS_TRUE(o) ((o == Py_True) || PyObject_IsTrue(o))

might make us both happy.

> 3) I would consider going so far as to restrict this class to only
>    supporting one type of object. (First object inserted sets the type,
>    or it is set via the constructor.)

Ah, I see you're thinking along the same lines as me.

> I'd like to avoid making the common case slow because of very unlikely
> potential edge cases. But I'm willing to discuss where an appropriate
> tradeoff is.

Sure.  I do think we'll be tempted to use things like this for more than your
original purpose if it works as well as it sounds like it will... so if it's
cheap to make it more widely applicable, or cheap to make it only work with
objects it's intended to work with, then we should do that.  If nothing else,
our tests have a tendency to substitute regular objects with strange test
doubles...

[...]
> The only bit that is close to copyrightable is:
> >> +            i = (i << 2) + i + perturb + 1
> >> +            entry = &table[i & mask]
> >> +            perturb >>= PERTURB_SHIFT
> 
> Since it is an explicit combination of parameters. I'm happy to just
> change it to something like:
>  http://en.wikipedia.org/wiki/Quadratic_probing
> 
> Which is: trivial to implement, just a different set of performance
> tradeoffs (you gain locality, at an increased chance for collisions),
> and certainly a public domain solution that would avoid any copyright
> issues.

I'd lean towards that then, simply because I'm not even close to qualified to
say what's safe to borrow.  If the performance is about the same then I don't
see any downside.

(Although, for a table of pointers, locality is probably less interesting, but
it may still help subsequent probes when there is a collision...)

[...]
> > It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
> > that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...
> >
> 
> Originally I was using:
> 
> cpdef bint discard(self, key):
>   ...
> 
> Which is Cython code that defines both a Python version and a C version,
> and casts integers into Booleans. I can certainly just do:

That's a nice feature!  A good way to keep the C and Python interfaces
harmonised.

> if self._discard(key);
>   return True
> return False

I think you may as well; I doubt it will make a difference to performance and it
will make the interface feel just that tiny bit nicer.

> >> +cdef PyObject **_lookup(SimpleSet self, object key) except NULL:
> >> +    """Find the slot where 'key' would fit.
> >> +
> >> +    This is the same as a dicts 'lookup' function.
> >> +
> >> +    :param key: An object we are looking up
> >> +    :param hash: The hash for key
> >> +    :return: The location in self.table where key should be put
> >> +        should never be NULL, but may reference a NULL (PyObject*)
> >> +    """
> > 
> > I don't quite follow the “should never be NULL, but may reference a NULL” part
> > of that docstring.
> > 
> 
> I shouldn't be using the term 'dict' here either.
> 
> Anyway, I'm returning a **. I'll try to clarify. But in essence:
> 
> PyObject **result = _lookup(...);
> 
> assert(result != NULL)
> if (*result) == NULL: # nothing is stored here

Oh, I see, because it's returning a pointer to a pointer.  I think I'd just say
“The location in self.table where should be put.  Will not be NULL.”  I think
it should go without saying that a particular slot may or may not be NULL.  If
you still want to mention it I'd say “(but the slot itself may point to NULL)”.

> > Hmm, this appears to be essentially:
> > 
> > <loop body>
> > while True:
> >     <increment stuff>
> >     <loop body>
> > 
> > Why can't it be:
> > 
> > while True:
> >     <loop body>
> >     <increment stuff>
> 
> So, there are a few subtle differences, but I'm not sure if it is worth
> the code duplication. Namely
> 
> 1) The first lookup is going to be the right location 90+% of the time.
> (if things are sized correctly, and your hash is working well, etc.) So
> you want to avoid any overhead that you can.
> 
> 2) One of that is checking whether free_slot is NULL or not.
> 
> 3) The loop code changes the order of checking for _dummy with checking
>    _is_equal, because in the claim is that _dummy rare enough to offset
>    the speed of the check with the speed of the comparison. I guess I'm
>    doubtful, too.

On the other hand maybe more compact code will fit better in the cache ;)

If the benefit is doubtful I'd go for simpler and shorter, myself.  If you'd
rather not bother, then at least put a comment there saying you've special-cased
the first iteration for performance because it should be all that's needed 90+%
of the time.

> > You don't need the semi-colon ;)
> > 
> > I wonder if there's any risk that an object referenced by a SimpleSet might
> > somehow mutate that SimpleSet during this traverse?
> 
> This is the same basic implementation that is used by dicts and sets.
> Namely, tp_traverse is used by the garbage collector, to 'doing stuff'
> is supposed to be forbidden (I assume).

Cool, that's what I was hoping.

> >> +    def assertRefcount(self, count, obj):
> >> +        """Assert that the refcount for obj is what we expect.
> >> +
> >> +        Note that this automatically adjusts for the fact that calling
> >> +        assertRefcount actually creates a new pointer, as does calling
> >> +        sys.getrefcount. So pass the expected value *before* the call.
> >> +        """
> >> +        # I'm not sure why the offset is 3, but I've check that in the caller,
> >> +        # an offset of 1 works, which is expected. Not sure why assertRefcount
> >> +        # is incrementing/decrementing 2 times
> >> +        self.assertEqual(count, sys.getrefcount(obj)-3)
> > 
> > Eep.  This is going to be pretty fragile across CPython versions, let alone
> > non-CPython Pythons...
> 
> Do you really think that refcounts are going to be fragile across
> CPython versions?

Well, probably not, but it wouldn't really surprise me either.

Another possible issue: this may break when run inside a debugger.

> > Can you perhaps use weakrefs instead?  They're enough to let you observe object
> > lifetimes (either via a callback, or just seeing that the weakref cannot be
> > resolved), although for portability you still need to call gc.collect() before
> > you can reliably know if the object is gone or not...
> 
> tuples can't have weakrefs. I really *do* want to assert that my
> refcounts are correct. And Pyrex code is *not* portable to anything but
> CPython.

Well, there are competing implementations of CPython these days, i.e. Unladen
Swallow, which is experimenting with garbage collection...

> > I guess we can worry about disabling the tests that need this on other Pythons
> > if/when we run on them...  (and of course we may not care about this code much
> > on non-CPython!).
> > 
> 
> Exactly. You can't write a C extension to Jython or IronPython...

There's work underway to make it possible to use CPython extensions in
IronPython: <http://www.resolversystems.com/news/?p=17>...

Anyway, I'm ok with asserting things about refcounts, just so long as it's kept
to a minimum :)

[...]
> > Hmm, again pretty fragile due to depending on Python internals.  Oh well.
> > 
> > -Andrew.
> 
> If you feel it would be less fragile, I can change that to use the
> integers corresponding to hash('a') or hash('a', 'a'). I really wanted
> to make sure things were working with StaticTuple, which uses the hash
> algorithm from tuple itself.
> 
> This code originally used all StaticTuple instances, until I refactored
> it for review. And as I mentioned, it originally wasn't going to allow
> arbitrary objects.
> 
> However, since it now *does*, I could just use a bunch of ints. It still
> depends on hash() details, which is python implementation specific... I
> could write my own class that returns __hash__ that I want, but it seems
> a bit YAGNI.
> 
> I'll simplify it, though, if it makes you feel more comfortable.

I'm ok with it as is.  I'm comfortable that we can easily enough change it later
if it does become a problem, and if it never does then even better ;)

-Andrew.