Bazaar

Code review comment for lp:~jameinel/bzr/2.1-simple-set

2.1-simple-set
Merge into bzr.dev

Revision history for this message

John A Meinel (jameinel) wrote on 2009-10-12:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> I would guess that doing
>>
>> add = dedup.add
>> for bar in group:
>> bar = add(bar)
>>
>> Is going to perform the same as the dedup[bar] form. But both should do
>> better than setdefault. (At a minimum, setdefault takes 2 parameters,
>> and thus has to create a tuple, etc.)
>>
>> I don't know if the interpreter has further internal benefits to using
>> the __getitem__ form, since it is a known api that has to conform in
>> certain ways.
>
> Right, I think you'd still see a bit of benefit in the __getitem__ form because
> Python knows that that operator only takes one arg and so avoids packing and
> unpacking an args tuple for it. (Like METH_O vs. METH_VARARGS.) ISTR some part
> of the zope.interface C extension intentionally abuses an obscure no-arg
> operator (maybe pos?) as that's the fastest way to invoke C code from Python.
>
>> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
>> potentially unexpected.
>
> Agreed. For readability my preference is .add, but obviously sufficiently large
> performance benefits may override that. I don't *think* the boost from using
> __getitem__ instead would be that much greater for this use, but I haven't
> measured so I may be very wrong :)
>
> [...]
>> How's this:
>>
>> # Data structure definition:
>
> Great!
>
>>>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>>>> + cdef long other_hash
>>>> + cdef PyObject *res
>>>> +
>>>> + if this == other:
>>>> + return 1
>>>> + other_hash = Py_TYPE(other).tp_hash(other)
>>> What happens if 'other' is not hashable?
>> Other will always be something that is already held in the internal
>> structure, and thus has been hashed. 'this' has also already been hashed
>> as part of inserting, and thus we've also checked that it can be hashed.
>>
>> I can change this to PyObject_Hash() if you feel strongly. I was
>> avoiding a function call overhead, though it probably doesn't have a
>> huge impact on performance.
>
> Well, badly-behaved objects might successfully give a hash value the first time
> and then get an error later. I do strongly lean towards paranoia in C code, so
> I think PyObject_Hash is probably safest. You ought to be able to construct a
> test that defines such a badly-behaved object to see just how bad the fallout
> is. If segfaults are possible then definitely close that hole!

No segfaults, but the potential to leave an error in the pipe. Which
causes random failures IIRC. (I forget the function where '-1' is maybe
an error, so you have to check PyErr_Occurred, etc.)

>
> Also, PyObject_Hash is where Python implements the logic that an object with no
> tp_hash in its type will use hash(id(obj)), so long as it doesn't have
> tp_compare or tp_richcompare. I think this is pretty common, e.g. the module
> type has no tp_hash. And given that you use the 'hash' global elsewhere in this
> module there's definitely some chance for confusion here.

So in my "_insert" I now have the assertion:
        if (Py_TYPE(py_key).tp_richcompare == NULL
            or Py_TYPE(py_key).tp_hash == NULL):
            raise TypeError('Types added to SimpleSet must implement'
                            ' both tp_richcompare and tp_hash')

So that I don't have to do those checks in the rest of the code.

>
> But for the case you're optimising for PyObject_Hash will just call the object's
> tp_hash immediately and return the result, so it should be a minimal penalty,
> just a C function call overhead.
>
> The thing I was really looking for though was checking the result value of
> tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

Yeah, I did that.

...

>
> #define PY_IS_TRUE(o) ((o == Py_True) || PyObject_IsTrue(o))
>
> might make us both happy.

For now, I'll live with just PyObject_IsTrue(). I'm avoiding RichCompare
and such, because they do a lot more work (with default-3way, etc.) But
for other stuff, this seems fine.

>
>> 3) I would consider going so far as to restrict this class to only
>> supporting one type of object. (First object inserted sets the type,
>> or it is set via the constructor.)
>
> Ah, I see you're thinking along the same lines as me.
>
>> I'd like to avoid making the common case slow because of very unlikely
>> potential edge cases. But I'm willing to discuss where an appropriate
>> tradeoff is.
>
> Sure. I do think we'll be tempted to use things like this for more than your
> original purpose if it works as well as it sounds like it will... so if it's
> cheap to make it more widely applicable, or cheap to make it only work with
> objects it's intended to work with, then we should do that. If nothing else,
> our tests have a tendency to substitute regular objects with strange test
> doubles...
>

Though as always, you can always implement things when it becomes
obvious that they are needed...

So for now, I'll make sure to propagate exceptions, but I won't restrict
it to a single-class. (It also means you can't store PyInts in this,
since they don't implement richcompare, but I'm ok with that for now.)

> [...]
>> The only bit that is close to copyrightable is:
>>>> + i = (i << 2) + i + perturb + 1
>>>> + entry = &table[i & mask]
>>>> + perturb >>= PERTURB_SHIFT
>> Since it is an explicit combination of parameters. I'm happy to just
>> change it to something like:
>> http://en.wikipedia.org/wiki/Quadratic_probing
>>
>> Which is: trivial to implement, just a different set of performance
>> tradeoffs (you gain locality, at an increased chance for collisions),
>> and certainly a public domain solution that would avoid any copyright
>> issues.
>
> I'd lean towards that then, simply because I'm not even close to qualified to
> say what's safe to borrow. If the performance is about the same then I don't
> see any downside.
>
> (Although, for a table of pointers, locality is probably less interesting, but
> it may still help subsequent probes when there is a collision...)
>

So in reading the python docs, etc. My general feeling is that the
set/dict double-hashing code probably performs slightly better. Mostly
because hash(obj) is generally not very good in terms of an ideal hash.
strings are known to cluster, ints *definitely* cluster, etc.

However, the collision resolution only triggers *when there is a
collision*, so as long as that is unlikely (handled by resizing the hash
table when it starts to get full), then it doesn't really matter.

Locality probably helps a little, when you table is 8MB. So you're sure
that the table itself is not held in L2 cache. Walking to Location+1 is
going to be orders of magnitude cheaper than walking to $RANDOM.

I'm slightly concerned because of the weakness of hash(), but only
slightly. I'd have to really see an effect to worry about it.

For my test case of "load all of launchpad's indices" I get:
25.938s => 25.780s

Which is certainly within the potential noise of everything else. So
while I think there are definitely cases where it will perform worse [eg
range(2**10, 2**10+50,000) will conflict on every key, and all use the
same probing chain], the common cases it will do fine. If I was to tweak
it, it would be to find a way to bring in more upper-bits of the hash,
which is what "+perturb; perturb >> PERTURB_SHIFT" was doing.

> [...]
>>> It'd be slightly nicer to return Py_False/Py_True. I guess you can easily do
>>> that by doing return bool(self._discard(key))? Or maybe with PyBool_FromLong...
>>>
>> Originally I was using:
>>
>> cpdef bint discard(self, key):
>> ...
>>
>> Which is Cython code that defines both a Python version and a C version,
>> and casts integers into Booleans. I can certainly just do:
>
> That's a nice feature! A good way to keep the C and Python interfaces
> harmonised.

It is. Cython really does have a decent number of improvements to make
writing cleaner code that still performs well. Unfortunately, there is
no way to optionally use them. For example there is the 'inline'
keyword. But the conditional compile (IF and DEF) does not include a
macro for the compiler version, so there is no way to say:

IF CYTHON:
def inline foo():
ELSE:
def foo():

(The only defines that are pre-set are UNAME stuff.)

...

>> tuples can't have weakrefs. I really *do* want to assert that my
>> refcounts are correct. And Pyrex code is *not* portable to anything but
>> CPython.
>
> Well, there are competing implementations of CPython these days, i.e. Unladen
> Swallow, which is experimenting with garbage collection...

Sure, but US is going to break a lot more than sys.getrefcount() :).

...

> Anyway, I'm ok with asserting things about refcounts, just so long as it's kept
> to a minimum :)

So there is a bit of trade-off. The code is working in refcounts
(Py_INCREF, Py_DECREF), and it seemed most logical to *test* it in those
terms. There is also a reasonable desire to make sure that we don't get
segfaults (to many decrefs) or memory leaks (too many increfs).
I certainly didn't pepper all tests with this, hopefully they are
focused enough that when we port Bazaar to IronPython it is obvious what
needs to be fixed. :)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrTVSUACgkQJdeBCYSNAAP6zACfaolraTg8YA3/E/lKKttyOHQz
c8AAoJbTEnMw2CHKB0hDVnpWZYEEj188
=/o+s
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andrew Bennetts wrote:
> John A Meinel wrote:
> [...]
>> I would guess that doing
>>
>> add = dedup.add
>> for bar in group:
>>   bar = add(bar)
>>
>> Is going to perform the same as the dedup[bar] form. But both should do
>> better than setdefault. (At a minimum, setdefault takes 2 parameters,
>> and thus has to create a tuple, etc.)
>>
>> I don't know if the interpreter has further internal benefits to using
>> the __getitem__ form, since it is a known api that has to conform in
>> certain ways.
> 
> Right, I think you'd still see a bit of benefit in the __getitem__ form because
> Python knows that that operator only takes one arg and so avoids packing and
> unpacking an args tuple for it.  (Like METH_O vs. METH_VARARGS.)  ISTR some part
> of the zope.interface C extension intentionally abuses an obscure no-arg
> operator (maybe pos?) as that's the fastest way to invoke C code from Python.
> 
>> The big concern is that "dedup[bar]" can mutate 'dedup' and that is
>> potentially unexpected.
> 
> Agreed.  For readability my preference is .add, but obviously sufficiently large
> performance benefits may override that.  I don't *think* the boost from using
> __getitem__ instead would be that much greater for this use, but I haven't
> measured so I may be very wrong :)
> 
> [...]
>> How's this:
>>
>>  # Data structure definition:
> 
> Great!
> 
>>>> +cdef int _is_equal(PyObject *this, long this_hash, PyObject *other):
>>>> +    cdef long other_hash
>>>> +    cdef PyObject *res
>>>> +
>>>> +    if this == other:
>>>> +        return 1
>>>> +    other_hash = Py_TYPE(other).tp_hash(other)
>>> What happens if 'other' is not hashable?
>> Other will always be something that is already held in the internal
>> structure, and thus has been hashed. 'this' has also already been hashed
>> as part of inserting, and thus we've also checked that it can be hashed.
>>
>> I can change this to PyObject_Hash() if you feel strongly. I was
>> avoiding a function call overhead, though it probably doesn't have a
>> huge impact on performance.
> 
> Well, badly-behaved objects might successfully give a hash value the first time
> and then get an error later.  I do strongly lean towards paranoia in C code, so
> I think PyObject_Hash is probably safest.  You ought to be able to construct a
> test that defines such a badly-behaved object to see just how bad the fallout
> is.  If segfaults are possible then definitely close that hole!

No segfaults, but the potential to leave an error in the pipe. Which
causes random failures IIRC. (I forget the function where '-1' is maybe
an error, so you have to check PyErr_Occurred, etc.)

> 
> Also, PyObject_Hash is where Python implements the logic that an object with no
> tp_hash in its type will use hash(id(obj)), so long as it doesn't have
> tp_compare or tp_richcompare.  I think this is pretty common, e.g. the module
> type has no tp_hash.  And given that you use the 'hash' global elsewhere in this
> module there's definitely some chance for confusion here.

So that I don't have to do those checks in the rest of the code.

> 
> But for the case you're optimising for PyObject_Hash will just call the object's
> tp_hash immediately and return the result, so it should be a minimal penalty,
> just a C function call overhead.
> 
> The thing I was really looking for though was checking the result value of
> tp_hash/PyObject_Hash — if its -1 then there's an error to be dealt with.

Yeah, I did that.

...

> 
> #define PY_IS_TRUE(o) ((o == Py_True) || PyObject_IsTrue(o))
> 
> might make us both happy.

For now, I'll live with just PyObject_IsTrue(). I'm avoiding RichCompare
and such, because they do a lot more work (with default-3way, etc.) But
for other stuff, this seems fine.

> 
>> 3) I would consider going so far as to restrict this class to only
>>    supporting one type of object. (First object inserted sets the type,
>>    or it is set via the constructor.)
> 
> Ah, I see you're thinking along the same lines as me.
> 
>> I'd like to avoid making the common case slow because of very unlikely
>> potential edge cases. But I'm willing to discuss where an appropriate
>> tradeoff is.
> 
> Sure.  I do think we'll be tempted to use things like this for more than your
> original purpose if it works as well as it sounds like it will... so if it's
> cheap to make it more widely applicable, or cheap to make it only work with
> objects it's intended to work with, then we should do that.  If nothing else,
> our tests have a tendency to substitute regular objects with strange test
> doubles...
>

Though as always, you can always implement things when it becomes
obvious that they are needed...

> [...]
>> The only bit that is close to copyrightable is:
>>>> +            i = (i << 2) + i + perturb + 1
>>>> +            entry = &table[i & mask]
>>>> +            perturb >>= PERTURB_SHIFT
>> Since it is an explicit combination of parameters. I'm happy to just
>> change it to something like:
>>  http://en.wikipedia.org/wiki/Quadratic_probing
>>
>> Which is: trivial to implement, just a different set of performance
>> tradeoffs (you gain locality, at an increased chance for collisions),
>> and certainly a public domain solution that would avoid any copyright
>> issues.
> 
> I'd lean towards that then, simply because I'm not even close to qualified to
> say what's safe to borrow.  If the performance is about the same then I don't
> see any downside.
> 
> (Although, for a table of pointers, locality is probably less interesting, but
> it may still help subsequent probes when there is a collision...)
>

I'm slightly concerned because of the weakness of hash(), but only
slightly. I'd have to really see an effect to worry about it.

For my test case of "load all of launchpad's indices" I get:
  25.938s => 25.780s

> [...]
>>> It'd be slightly nicer to return Py_False/Py_True.  I guess you can easily do
>>> that by doing return bool(self._discard(key))?  Or maybe with PyBool_FromLong...
>>>
>> Originally I was using:
>>
>> cpdef bint discard(self, key):
>>   ...
>>
>> Which is Cython code that defines both a Python version and a C version,
>> and casts integers into Booleans. I can certainly just do:
> 
> That's a nice feature!  A good way to keep the C and Python interfaces
> harmonised.

IF CYTHON:
  def inline foo():
ELSE:
  def foo():

(The only defines that are pre-set are UNAME stuff.)

...

>> tuples can't have weakrefs. I really *do* want to assert that my
>> refcounts are correct. And Pyrex code is *not* portable to anything but
>> CPython.
> 
> Well, there are competing implementations of CPython these days, i.e. Unladen
> Swallow, which is experimenting with garbage collection...

Sure, but US is going to break a lot more than sys.getrefcount() :).

...

> Anyway, I'm ok with asserting things about refcounts, just so long as it's kept
> to a minimum :)

John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkrTVSUACgkQJdeBCYSNAAP6zACfaolraTg8YA3/E/lKKttyOHQz
c8AAoJbTEnMw2CHKB0hDVnpWZYEEj188
=/o+s
-----END PGP SIGNATURE-----

« Back to merge proposal